Design of Very Deep Pipelined Multipliers for FPGAs
Alex Panato, Sandro Silva, Flávio Wagner, Marcelo Johann, Ricardo Reis, Sergio Bampi
Universidade Federal do Rio Grande do Sul - Instituto de Informática
Av Bento Gonçalves, 9500, Bloco IV, Porto Alegre, RS, Brazil
e-mail:
@inf.ufrgs.br
Abstract
This work investigates the use of very deep pipelines for
implementing circuits in FPGAs, where each pipeline
stage is limited to a single FPGA logic element (LE). The
architecture and VHDL design of a parameterized integer
array multiplier is presented and also an IEEE 754
compliant 32-bit floating-point multiplier. We show how to
write VHDL cells that implement such approach, and how
the array multiplier architecture was adapted. Synthesis
and simulation were performed for Altera Apex20KE
devices, although the VHDL code should be portable to
other devices. For this family, a 16 bit integer multiplier
achieves a frequency of 266MHz, while the floating point
unit reaches 235MHz, performing 235 MFLOPS in an
FPGA. Additional cells are inserted to synchronize data,
what imposes significant area penalties. This and other
considerations to apply the technique in real designs are
also addressed.
1. Introduction
Pipelines are widely used to improve the performance
of digital circuits, since they provide a simple way of
implementing parallelism from streams of sequential
operations. As more stages are inserted in the pipeline,
each stage becomes shorter, and ideally presents a smaller
delay. So, the resulting circuit will exhibit bigger latency
but higher sustained performance when the pipeline is
fully utilized.
Theoretically, we can push the pipeline depth to a level
of using a single gate between two registers. But usually,
there is a compromise between performance improvements
obtained with increased pipeline depth and the penalties
imposed by the additional memory elements inserted in
between the stages.
In [1] it is presented an 8-bit, full custom, integer
multiplier using pipeline stages of a single half adder. [2]
uses the same methodology to implement a two’s
complement multiplier. Besides the high throughput
achieved, the techniques need very complex and manual
work, since they employ full custom design, and work
only to specific technologies and bit widths, not being
accessible to regular ASIC designs.
In this work we investigate a methodology to design the
deepest pipelined circuits in FPGAs, starting from VHDL.
FPGA devices have some specific characteristics that
allow the designer to implement a "gate level" pipeline
with optimal performance, the only remark being that the
word gate here means any 4-input function with a single
output. Longer stages will present twice the delay of logic
elements and will use an outside connection. Shorter
stages do not take advantage of the fact that the FPGA cell
can implement any function with the same delay. The idea
already appears in an Altera Application Brief [3], but we
did not find descriptions and results of a methodology or
implementation anywhere else.
Despite the fact that FPGA architectures differ from
vendor to vendor, they still present a set of basic common
features that allow building gate level pipelines in VHDL.
By doing so, it is possible to reuse and map the design to
many different devices, and reuse it at the press of a
button, in contrast to the full custom approach
As a case study, we developed an architecture for
integer multiplication that exploits the deepest pipelines
and then we build a floating point multiplication unit that
is able to perform 235 MFLOPS in an Altera Apex20KE
device. The integer architecture is parameterized to any
number of bits, what increases its applicability. Yet the
floating-point unit is restricted to only single precision, 32-
bit, as presented in this paper, but can be easily extended
to larger widths.
The rest of the paper is organized as follows. Section 2
explains how the technique is employed, and presents
some information about the Altera APEX FPGA
architecture. Section 3 describes the design of a
parameterized array integer multiplier, along with its
performance, which is compared to the default multipliers
offered in the Altera library. In section 4 a complete IEEE
754 compliant floating pointer multiplier is presented,
which not only achieves higher absolute frequency, but
also better performance/area tradeoff. Finally, section 5
presents some discussion and our concluding remarks.
2. Deep pipelines in FPGA
An FPGA device is generally an array of configurable
basic blocks called logic cells (LCs) or logic elements
(LE)s. The second term is preferred and used throughout
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum (DATE’04)
1530-1591/04 $20.00 © 2004 IEEE
this paper to avoid misinterpretation. Abstracting
implementation details, one can think of an LE as being
composed of a look-up table (LUT), a register cell (latch,
flip-flop), and a multiplexer, as it is shown in Fig. 1. The
LUT can implement any truth table up to a given number
of inputs, 4 in this case, although there are other simple
gates outside the LUT that let the LE implement some
functions with a larger number of inputs. The multiplexer
selects the output from the LUT or the register as the
output of the LE.
The pipeline depth of a circuit implemented in an FPGA
can be pushed to the level of using the register cell in
every single LE that is necessary. We may still try to put
as much logic as we can inside a single LE, basically in the
LUT part. But every time a combinational circuit does not
fit into one LE, an additional stage is introduced. This will
guarantee that there is no path longer than a single LE
between any two storage elements, and is the shortest
possible path between them. Therefore, our main goal is to
design circuits that have this property, and to check out the
performance limits that can be achieved in these devices.
LUTLUT
Figure 1. A simplified FPGA Logic Element
In many cases, a synthesis process with a VHDL
description as input can produce gate net-lists that are not
exactly what the designer wanted. But to implement
pipelines at the level of logic elements, this situation must
not occur, and the designer must have full control of the
resulting implementation. Hopefully, using the register
cells is easier that one might think. It is sufficient to
include a clause that depends on the clock signal to make
the output assignment of small partial functions, just as it
is done in normal situations to describe a memory element.
These partial functions will be contained in basic entities
that may be instantiated to build the circuit under
consideration.
Fig. 2 shows an example of a half adder description
where the outputs are registered, and can be used as a
stage of the deep pipeline. The synthesis generates two
logic elements for this entity, one for the sum output S,
and the other for the carry output COUT. Such an entity
can be instantiated anywhere in a bigger design, and the
result of its synthesis will still be the same pair of LEs.
This explains how to describe basic functions in which
the design must be decomposed. The architecture of the
circuit must also be adapted to allow this decomposition,
since not all possible logic functions fit into a single LE.
The rule when defining basic entities is that for every
output, there should be at most four inputs that may affect
its result, because LUTs have 4 inputs.
12
13
14
15
16
17
18
19
20
21
22
Architecture arch_1 of mcell1 is
signal soma,carry: std_logic;
begin
soma <= PP xor CIN;
carry <= PP and CIN;
process begin
wait until CLK = '1';
S <= soma;
COUT <= carry;
end process;
end arch_1;
Figure 2. VHDL of a basic block
There is another point that must be observed and
significantly affects the design of the circuit. All the paths
from the inputs to the outputs must pass through the same
number of LEs. This is necessary to synchronize data in
the pipeline. Whenever a path from an input to an output is
shorter than the largest one, additional delay elements
must be inserted to make the data flow at the same (logic)
speed. These delay cells can be declared just as the other
entities were. As it might be expected, the insertion of LEs
whose sole purpose is to produce delays greatly affects
circuit area, increasing device usage, and possibly limiting
the application of such approach.
In our implementations, basic entities are grouped
together using structural VHDL to form bigger building
blocks. This hierarchy allows us to adequate the circuit
structure to the FPGA architecture and to perform some
placement optimizations that are described later on.
In the Apex20K family of devices, each set of 10 LEs is
grouped in a structure called Logic Array Block (LAB),
which in turn is grouped in sets of 10 or 16 to form the
MegaLab structure. Communication between LEs in the
same LAB is extremely fast, with minimal delays caused
by interconnects. Connections between LEs in different
LABs inside the same MegaLab have bigger delays, but
are still very fast. Delays between LEs start to become
critical in the interconnections when they are placed in
separate MegaLabs. There is still, however, a set of "fast
interconnects" between neighbor MegaLabs that can be
used to keep the signals with minimal delays. But they are
limited in the sense that are restricted only to neighbor
MegaLabs and also because there are only a few of these
fast lines. Given that a MegaLab is not square, the system
will run out of horizontal connections first.
3. Design of an Integer Multiplier
In order to test the gate level pipeline technique just
explained, an integer array multiplier was first designed.
The fastest types of multipliers are the parallel ones, as
Wallace [5] and Dadda [6] architectures. However, these
architectures do not have the same regular structure
already present in the Array multiplier [7]. As Fig. 3a
shows, each line of logic cells computes basically one
partial product, and could be a separate pipeline stage. A
comparison among this approach and the parallel
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum (DATE’04)
1530-1591/04 $20.00 © 2004 IEEE
architectures for an ASIC are found in [8]. We chose the
Array architecture as the starting point. By inspection one
can see that delays are also propagated horizontally.
Therefore, if partial products were used as pipeline stages,
each stage would have to wait for the propagation of carry
signals and will then have the delay of many LEs.
(A) (B)carrycarrycarry
carrycarry
carrycarrycarry
carrycarry
Figure 3. Classic and adapted carry propagation
So, in order to implement the gate level pipeline, there
were two options. The first one was to consider the circuit
as running diagonally from top-right to bottom-left, and
the other was to adapt the carry propagation to be taken
into account only at the next stage, as Fig. 3b shows. We
chose the second approach, as we expected it to minimize
the amount of additional delay elements to be inserted.
3.1 Multiplier architecture
Fig. 4 shows an example of a 4-bit integer multiplier
where it is possible to observe the elaborated architecture.
The next two sections explain the basic blocks and the
intermediate level structure, respectively.
mrow1
mrow_middle
mrow_pre_last
mrow_last
A3 A3 A2 A2 A1 A1 A0 A1 A0
B0
B1
B2
B3
S7 S6 S5 S4 S3 S2 S1 S0
Figure 4. 4-bit array multiplier
3.2 Basic building blocks
Six types of basic blocks are needed to implement the
integer multiplier (see Fig. 5), and they are:
a) A half adder that adds results and carry, called mcell1;
b) A multiplication block that computes the sum of two
single bit multiplications, called mcell2a;
c) A multiplication block that computes a single bit
multiplication and runs the result into a full adder,
called mcell2;
d) A block for propagation only, called mcell3;
e) A half adder without carry, called mcell4;
f) A double delay block, to propagate input B and
results, called mcell5;
P
PP
PP
(D)
S
PP Cin
S
(E)
P
B BB
PP
PP
(F)
P
PP
PP
(D)
P
PP
PP
P
PP
PP
(D)
S
PP Cin
S
(E)
S
PP Cin
S
S
PP Cin
S
(E)
P
B BB
PP
PP
(F)
P
B BB
PP
PP
P
B BB
PP
PP
(F)
CS
Cout S
PP Cin
(A)
C
A
S
Cin
Cout
B
PP
AS
A
(B)
B1C
A
S
Cout
A1
AS
A0
B0
(C)
CS
Cout S
PP Cin
(A)
CS
Cout S
PP Cin
CS
Cout S
PP Cin
(A)
C
A
S
Cin
Cout
B
PP
AS
A
(B)
C
A
S
Cin
Cout
B
PP
AS
A
C
A
S
Cin
Cout
B
PP
AS
A
(B)
B1C
A
S
Cout
A1
AS
A0
B0
(C)
C
A
S
Cout
A1
AS
A0
B0
C
A
S
Cout
A1
AS
A0
B0
(C)
Figure 5. Basic building blocks
3.3 Intermediate level structure
Instances of the basic blocks are used to assemble
intermediate level, regular blocks, that perform specific
tasks in the multiplier (see Fig. 4). The four intermediate
level structures are:
mrow1: Corresponds to the first stage of the pipeline
and is composed of n mcell2a cells. Two bit
multiplications are possible at this stage since it does not
have a previous one, and bit multiplications are
implemented with AND gates. The output of this structure
is a vector of n+1 bits for results and a vector of n-1 bits
for carry out.
mrow_middle: Corresponds to the next n-2 stages.
Each stage uses n blocks of mcell2 and some blocks of
mcell5 for propagation of previous results.
mrow_pre_last: The propagation of inputs A and B are
no longer needed. So, the nth stage uses n-1 instances of
mcell1 for carry adjust and n instances of mcell3 for
propagating previous result.
mrow_last: Implements the n-1 last stages of the
pipeline, performing only carry adjust in the most
significant bits. Each line uses one instance of mcell4,
except the first, who uses a mcell3 instead, and instances
of mcell1 and mcell3, starting with n-2 instances of mcell1
and n+1 instances of mcell3. In the following lines, the
number of mcell1 instances is decreased by 1, and the
number of mcell3 instances is increased by 1.
There are also adjacent structures for mrow1 and
mrow_middle described in the top entity to compute the
least significant bit of the first stages. The first one uses a
simple AND gate, and the next ones synchronize B inputs
and propagate the result of the least significant bit.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum (DATE’04)
1530-1591/04 $20.00 © 2004 IEEE
3.4 Latency and Logic Elements prediction
We described the architecture of the multiplier in a
parameterized way, so that it can be instantiated for any
required number of bits. Since the implementation is
highly regular, both latency and circuit size can be
predicted. The latency will always be (2*n-1) cycles.
The number of LEs used in each basic block is well
known, and corresponds to the number of small squares
shown in Fig. 5. So, it is possible to estimate the total size
of the resulting circuit in number of LEs by the following
equation, for a multiplier of n bits:
�
−
=
⋅+−⋅−⋅=
2
1
2 3235# n
i
innLEs
Table 1 presents the latency and circuit size prediction
for commonly used data widths. The 24 and 54 bit widths
are used in the floating point IEEE 754 standard, which
will be discussed in section 4.
Table 1. Latency and size prediction.
#bits Latency #LEs
4 7 75
8 15 357
16 31 1545
24 47 3565
32 63 6417
54 107 18550
64 127 25915
3.5 Implementation and Simulation Results
We tested the resulting performance of the integer
multiplier synthesized for a range of bit widths from 4 to
64. Table 2 shows in the second column the operating
frequency achieved in each circuit. The first thing to note
is that the performance obtained is very high for this kind
of device. In fact, we investigated why the 4 and 8 bit
circuits presented the same performance, and found out
that there is a limitation in the device due to the clock
distribution. This leads us to two conclusions. The first is
that these versions of 4 and 8 bits could possibly operate in
higher frequencies provided that their paths are not the
limiting factor. And the second is that our results are very
close to the limiting aspects of the FPGA technology, and
will be hardly outperformed. But the performance still
drops a little as the bit widths become wider. This is due to
the delay of interconnects, which increases with the
increase in circuit size. The third column shows the
operating frequency of multipliers generated by the
Altera´s MegaWizard tool. We can see in the next column
the improvement that we get over this standard solution
(up to 3 times). Of course, it is very important to note that
our latency is really giant compared to other solutions, and
therefore the technique should not be used as a common
solution. However, in many applications, we do have the
scenario where uninterrupted multiplication sequences are
needed, such as in filters, other DSP operations, in such a
way that a long latency will be accepted.
Table 2. Operating frequencies.
#bits UFRGS Altera ratio
4 290 279 1,04
8 290 200 1,45
16 266 137 1,94
24 217 120 1,80
32 218 107 2,04
54 165 43,3 3,84
64 141 50,7 2,78
However, the main drawback of this approach is the
circuit size. Table 3 compares the device usage of our
solution against the one resulting from the Altera
multipliers. It is possible to see that our penalty in area is
bigger that our gain in performance for most of the
circuits. Although deep pipeline still wins in the 54 bit
operations, circuit size may limit severely the application
of such technique. It might be better to replicate standard
components, which not only operate in parallel but also
have lower latency.
Table 3. Comparing circuit sizes.
#bits UFRGS Altera ratio
4 75 34 2,21
8 357 160 2,23
16 1545 560 2,76
24 3565 1256 2,84
32 6417 2160 2,97
54 18550 6188 3,00
64 25915 8610 3,01
3.6 Placement optimizations
Although the simulation reveals very high clock rates,
they may still be enhanced if we can reduce the longer
interconnect paths that start to appear in larger circuits.
One option is to develop a specific placement method for
this architecture. But from the designer´s perspective, we
are left with the option of manually placing the
synthesized cells.
In order to investigate its impact on performance, we
manually adjusted the placement of a 32-bit multiplier
using the Logic Lock tool available in Altera Quartus II,
version 2.2. With this tool it is possible to specify regions
where VHDL entities must be placed.
For best performance, blocks with strong interaction
shall be close, preferably in the same MegaLab. We
grouped the design in each pipeline stage.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum (DATE’04)
1530-1591/04 $20.00 © 2004 IEEE
These groups must be placed according to the data
flow, starting from mrow1, following the mrow_middle
instances, then the mrow_pre_last and finally to the
mrow_last instances. Fig. 6 shows an example of such a
placement for the first rows. These first groups include
also the path traversed by the least significant bits.
Figure 6. Region assignment with Logic Lock.
We have manually placed the blocks at the same
column of MegaLabs in the FPGA, just changing to a
neighbor column when the first one is filled up. This is
better that placing them line by line, since there are more
fast lines in the vertical direction, as it was described
before. In fact, one of the most sensible points was at the
turn point where we change the column. Whenever this
turn is placed, the performance drops strongly because we
run out of horizontal fast lines and the connections start to
use slower ones.
The solution to this problem was to slightly alter the
original placement of each group in a MegaLab, from the
case depicted in Fig. 7a to the one shown in Fig. 7b. In this
cas