# A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

##
A Low-Power Vector Processor Us…

上传者：
rapanda
2013-12-10
评分*1*
评论*0*
下载*0*次
收藏*10*人
阅读量*616*
暂无简介
简介
举报
**简介：**本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》，可适用于硬件技术领域，主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

A Low-Power Vector Processor Using Logarithmic
Arithmetic for Handheld 3D Graphics Systems
Byeong-Gyu Nam and Hoi-Jun Yoo
Dept. of EECS, Korea Advanced Institute of Science and Technology (KAIST)
373-1, Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea
Abstract- A low-power, high-performance 4-way 32-bit vector
processor is developed for handheld 3D graphics systems. It
contains a floating-point unified matrix, vector, and elementary
function unit. By utilizing the logarithmic arithmetic, the unit
achieves single-cycle throughput for all these operations except
for the matrix-vector multiplication with 2-cycle throughput. The
processor featured by this function unit, cascaded integer-float
datapath, reconfiguration of datapath, operand forwarding in
logarithmic domain, and vertex cache takes 9.7mm2 in 0.18µm
CMOS technology and achieves 141Mvertices/s for geometry
transformation and 12.1Mvertices/s for OpenGL transformation
and lighting at 200MHz with 86.6mW power consumption.
I. INTRODUCTION
The handheld graphics processing units (GPUs) incorporate
vector processors, known as shaders, in their 3D graphics
pipeline stages to provide more realistic images [1]. In [2], a
vertex shader is proposed with 16-way floating-point (FLP)
multipliers for the fast geometry transformation and it
consumed a large silicon area and power. For the power- and
area-efficient design of the shaders, a multifunction unit was
proposed in [3]. However, it was a fixed-point unit and didn’t
deal with the matrix-vector multiplication, frequently used for
3D geometry transformations.
In this paper, a 4-way 32-bit FLP vector processor is
proposed for the shaders. It adopts a unified matrix, vector,
and elementary function unit, which unifies all these
operations in a single 4-way arithmetic unit. The unit operates
on the FLP data since the newly defined graphics API requires
more than 24-bit FLP precision [1]. Although it operates on
the FLP data, it uses the logarithmic arithmetic for the internal
arithmetic to reduce the arithmetic complexity. Its instruction
set includes matrix-vector multiplication (MAT), vector
operations (VEC) like vector multiplication (MUL), division
(DIV), divide by square-root (DSQ), multiply-add (MAD) and
dot-product (DOT), and elementary functions (ELM)
including trigonometric functions (TRGs), power (POW), and
logarithm (LOG). It achieves single-cycle throughput with
maximum 5-cycle latency for all these operations except for
the MAT with 2-cycle throughput and 6-cycle latency. The
processor has a cascaded structure of integer and FLP
datapaths for efficient indexing of FLP operands. The
configuration of the FLP multifunction unit is supported to
extend the instruction set according to user’s requirements. Its
pipeline exploits the operand forwarding in the logarithmic
domain to improve pipeline throughput and the computation
accuracy. A vertex cache is adopted to reuse the previously
processed results and improves throughput of the processor.
II. ARITHMETIC UNIT
A. Number System
The proposed arithmetic unit is based on the hybrid
approach of the FLP and the logarithmic number system
(LNS) introduced in [4], where operations are reduced into
simpler ones in the LNS while the addition and subtraction are
performed in FLP since the LNS addition and subtraction
require nonlinear term evaluations. The logarithmic and
antilogarithmic converters between the FLP and the LNS are
proposed for this hybrid number system (HNS).
15-entry LUT (64B)
x
m>>ci m>>di m>>ei
CSA
CPA
CSA
CSA
ci di ei bi
m
e aim+bi
8entry LUT (56B)
f>>ci f>>di
CPA
CSA
CSA
ci di bi
X
e
f
ai f+bi
(a) Logarithmic converter (b) Antilogarithmic converter
Fig. 1. Proposed number converters
1) Logarithmic Converter
For the 32-bit FLP input 2 (1 )ex m= + , its logarithmic
number is represented as 2 2log log (1 )x e m= + + . The
exponent e is the integer part of the logarithmic number and
2log (1 )m+ is the fractional part. The 2log (1 )m+ is
approximated by piecewise linear expressions as
2log (1 )m a m b+ + , where a and b are the approximation
coefficients defined for each approximation region.
The logarithmic converter divides the (1+m) into finer
subdivisions around the region near to 1 since the error
increases as the input value gets closer to 1. It achieves
maximum 0.41% conversion error with 15 approximation
regions. Fig. 1(a) shows the proposed logarithmic converter.
2) Antilogarithmic Converter
For the logarithmic number X, its FLP number can be
represented by 2 2 2 2X e f e f+= = . The integer part e directly
becomes the exponent of the FLP number and the nonlinear
term 2 f is approximated by piecewise linear expressions like
2 f a f b + , where a and b are the approximation
coefficients defined for each approximation region.
The antilogarithmic converter evenly divides the f into the
approximation regions since the antilogarithmic conversion
error is spread over the entire input region evenly. It achieves
1-4244-1125-4/07/$25.00 2007 IEEE. 232
maximum 0.08% conversion error with 8 approximation
regions. Fig. 1(b) shows the antilogarithmic converter.
B. Organization
The arithmetic unit is organized with 4-channels and 5
pipeline stages as shown in Fig. 2. The 4 of input 32-bit FLP
operands are converted into the logarithmic number with 8-bit
integer, 24-bit fraction, 1-bit sign, and 1-bit zero through the 4
logarithmic converters (LOGCs) in the E1. The E2 includes
the programmable multiplier (PMUL) as shown in Fig. 3,
which can be used for the Booth multiplier for ELM, 4
LOGCs for VEC or 4 antilogarithmic converters (ALOGCs)
for MAT by just adding 15-entry 64B LOG and 8-entry 56B
ALOG lookup-tables to the Booth multiplier and sharing the
CSA tree and a CPA. In this way, the number of LOGCs in E1
is reduced to 4, which were 8 in [3]. In E3, 4 adders in
logarithmic domain are provided for the VECs and the
resulting values are converted into the FLP numbers through
the 4 ALOGCs. The programmable adder (PADD) in the E4
can be programmed into a 5-input FLP adder tree or 4-way 2-
input SIMD FLP adders for target operations as proposed in
[3]. The E5 provides a SIMD FLP accumulator for the final
accumulation required for the MAT. It is also used as a
rounding logic for other operations.
Channel 3Channel 2Channel 1Channel 0
LOGC
ALOGC
x1
LOGC
ALOGC
LOGC
ALOGC
CPACPACPA
>> >> >>
LOGC
ALOGC
y0 z0
CPA
>>
CPA
y1 z1
CPA
y2 z2
CPA
y3 z3
CPA
x2 x3
x0
MAT
2nd
x2 x3
MAT
2nd
fpAcc fpAcc fpAcc fpAcc
TRGTRGTRGTRG
MAT
POW
>>
MAT MAT MATMATMAT MAT
Programmable Multiplier
CPA CPAProgrammable Adder
32 32 32 32 32 32 32 32 32 32 32 32
32 32 32 32
10 10
10 10 01 10 01 01 10 01 01 10
10
10
01101010
Fig. 2. Proposed arithmetic unit
1) Matrix-Vector Multiplication
The geometry transformation in 3D graphics can be
computed by the multiplication of 44-matrix with 4-element
vector, which requires 16 multiplications and 12 additions.
This can be converted into the HNS as in Fig. 4 requiring 20
LOGCs, 16 adders, 16 ALOGCs, and 12 FLP adders. Since
the coefficients of a geometry transformation matrix are fixed
during processing of 3D object, these can be pre-converted
into the logarithmic domain and used as constants during the
processing. Thus, the MAT only requires 4 LOGCs for vector
element conversion, 16 adders in logarithmic domain, 16
ALOGCs and 12 FLP adders. This can be implemented in 2
phases on this 4-way arithmetic unit as illustrated in Fig. 4. In
this scheme, 8 adders in logarithmic domain and 8 ALOGCs
are required per phase and the 8 ALOGCs in the first phase
are obtained from 4 ALOGCs in E2 by programming the
PMUL into 4 ALOGCs together with the 4 ALOGCs in E3.
The CPAs in E1 and E3 are used for the 8 adders in
logarithmic domain. The 4 multiplication results from the
ALOGCs in E2 and the other 4 from the E3 are added in the
E4 by programming the PADD into 4-way SIMD FLP adder
to get the first phase result. With the same process repeated,
the accumulation with the first phase result in E5 completes
the MAT. Thus, the MAT is implemented with 2-cycle
throughput on this 4-way arithmetic unit, where it was
implemented with 4-cycle throughput in conventional way
[3][5].
Fig. 3. Programmable multiplier (PMUL)
00 01 02 03 00 030 01 02
10 11 12 13 10 131 11 12
0 1 2 3
20 21 22 23 2 20 21 22 23
3 31 3230 31 32 33 30 33
c c c c c cx c c
c c c c c cx c c
x x x x
c c c c x c c c c
x c cc c c c c c
= + + + =
2 00 2 032 01 2 02
2 10 2 132 11 2 12
2 0 2 1 2 2
2 20 2 232 21 2 22
2 30 2 31 2 32 2 33
log loglog log
log loglog loglog log loglog loglog log
log log log log2 2 2 2
c cc c
c cc cx x xc cc c
c c c c
+ + + + + +
2 3log x
+
Fig. 4. Two-phase implementation of MAT
2) Vector and Elementary Functions
The VECs and ELMs including the MUL, DIV, DSQ,
MAD, DOT, POW, LOG, and TRGs are implemented based
on the scheme in [3]. Since the vector operations require 2
LOGCs for 2 operands per channel, the PMUL is programmed
into 4 LOGCs to make the 8 LOGCs for 4 channels together
with the 4 LOGCs in E1.
Since the power is converted into the multiplication in
logarithmic domain, it requires a 32b24b multiplier and it is
233
implemented by programming the PMUL into a single
32b24b BMUL. The TRGs are unified with others using the
Taylor series as proposed in [3]. This power series requires a
4-way 32b6b multiplier in logarithmic domain and final
summation of these terms. This can be implemented by
programming the PMUL into 4-way 32b6b BMUL and the
PADD into a 5-input summation tree.
III. PROCESSOR ARCHITECTURE
A. Instruction Set Architecture
The processor has two types of instruction format with 32-
bits and 64-bits. The 32-bit instructions are used for control
and integer (INT) operations, while the 64-bit instructions are
used for the FLP operations. This separation of format rather
than a single 64-bit format results in reduced instruction
memory (IMEM) size and power dissipation.
In a loop execution, the register indices for FLP operands
may be calculated dynamically as a function of the loop
counter i as follows.
OPR Dst[i+kd],SrcA[i+ka],SrcB[i+kb],SrcC[i+kc]
Thus, the INT instructions calculating the indices can be
embedded in the FLP instruction as the FLP operand fields for
efficient calculation of the indices. In this case, INT
computation is followed by the FLP computation by a single
64-bit instruction shown in Fig. 5.
Fig. 5. 64-bit FLP instruction format
B. Micro-architecture
Fig. 6 shows the micro-architecture for the proposed
processor. It has 4-way 32-bit FLP vector register files
including 512-byte 32-entry vertex input register (VIR),
general purpose register file (GPR), 4KB 256-entry constant
memory (CMEM), and 256-byte 16-entry vertex output
register (VOR). The FLP operands are fetched from the GPR,
VIR or CMEM and the result is written back to the GPR or
VOR indexing the target entry in the destination register file.
The FLP operands can be swizzled, negated, and converted
into the absolute value by the source modifiers. The arithmetic
unit proposed in section 2 is used as the FLP multifunction
unit in this processor. The vertex cache is composed of 16
VORs.
1) Cascaded INT-FLP datapaths
This processor has a cascaded architecture of 4-way 8-bit
INT and 4-way 32-bit FLP datapaths to implement the
embedded index calculation of the FLP operands without
using additional cycles. For flexible index calculation, this
processor includes 4-way 8-bit SIMD integer ALU and 64-
byte 16-entry integer register file (IGPR). The 4 of 8-bit
results from this unit index 3 source operands and 1
destination of a FLP operation.
When a multiplication is required for the index calculation,
the PMUL in FLP arithmetic unit is programmed into a
32b24b integer multiply-add (IMAD) unit since the
conversion error is not allowed for the index calculation.
2) Datapath Reconfiguration
The FLP unit in section II has several MUXes to unify
various operations and reveals all the control points to the
programmer so that they can make arbitrary operations on it
by programming the 91-bit control signals. For example,
various TRGs [3] supported in this processor can be
programmed by a single configuration instruction (CFG) with
some configuration data rather than including all of the TRGs
in the limited instruction space. The programmed
configuration data are stored in the 64-byte 4-entry
configuration register file (CFR) and accessed by the CFG.
Thus, the configuration can be changed in every cycle.
IMEM
(2KB)
GPR
(512B)
IGPR
(64B)
SWZ NEG ABS SWZ NEG ABS SWZ NEG ABS
FLP
multifunction
unit
ALU
8b 8b 8b
8b
32b
128b 128b 128b
128b
128b
128b
32b
128b
Fetch Decode
32b
128b
Vertex
Fetch
128b
Vertex
CMEM
(4KB)
VIR
(512B)
DMEM
(2KB)
128b
128b
CFR
(64B)
91b
configuration
Float
Dec. 0
1
CFG
2b
91b
ctrl
IndexIndex Index IndexData Data Data
Index Data
Vertex Cache
(16 VORs)
(4KB)
Fig. 6. Micro-architecture of proposed processor
3) Logarithmic-domain Forwarding
The operand forwarding improves the throughput of the
processor pipeline. In this processor, the operand forwarding
is also supported in the logarithmic domain. As shown in Fig.
7, in the case of consecutive FLP operations without requiring
final FLP adders, the antilogarithmic and logarithmic
conversions cancel each other and the intermediate
logarithmic value of the previous FLP operation is forwarded
directly into the logarithmic domain of the next FLP operation
bypassing the antilogarithmic and logarithmic converters of
two operations. This reduces the pipeline latency and the
computation error by bypassing the repeated antilogarithmic
and logarithmic conversions, which are the sources of errors.
234
E4 E5E1
Log
E2
E2
E3
Alog
E4 E5
E1
Log
E2
E1
Log
E2
E3
Alog
E3
Alog
E4 E5
E4 E5
a
b
t
c
a
b
c
op1
op2
op1
op2
Log
Alog
Log
Log
Log
Alog
Log
Log
Log
Alog
Logarithmic-domain forwarding
Conventional forwarding
Cancelled
E1
Log
E3
Alog
cycle
reduction
Fig. 7. Logarithmic-domain forwarding
4) Vertex Cache
A transformation and lighting (TnL) vertex cache is provided
to reuse the previously processed vertices without executing
the TnL routine. The 4KB SRAM can contain 16 result
vertices with 58% hit rate. This leads to a single-cycle TnL for
the vertices in the vertex cache and peak geometry
transformation (TFM) of 141Mvertices/s together with the 2-
cycle MAT.
IV. IMPLEMENTION RESULTS
A. Chip Implementation
The proposed processor is integrated into a 3D graphics
chip as a vertex shader and fabricated in 0.18µm 6-metal
CMOS technology [6]. The chip micrograph and summary of
chip characteristics are shown in Fig. 8. The core size is
9.7mm2 and operates at 200MHz consuming 86.8mW at 1.8V.
Fig. 9 shows the shmoo. The clock to pipeline registers is fully
gated to disable unnecessary switching under the control of
each instruction. All the FLP instructions are processed with
single-cycle throughput, except for the MAT with 2-cycle
throughput. Table I shows the latencies and throughputs for
FLP instructions.
Fig. 8. Chip micrograph
O
pe
ra
tin
g
Fr
eq
ue
nc
y
(M
H
z)
Fig. 9. Shmoo
TABLE I
THE LATENCY/THROUGHPUT OF SELECTED INSTRUCTIONS
ADD DIV DSQ MAD DOT MAT POW SIN
Throughput 1 3 3 5 5 6 3 5
Latency 1 1 1 1 1 2 1 1
B. Comparison
The performance is compared for the full OpenGL TnL
routine [7] with model-view, normal, and perspective
transformations, normalizations of light, view, normal, and
Blinn half vectors, and intensity calculations of diffuse and
specular lightings for a single light source. The routine
includes 3 MATs, 4 DSQs, 2 DIVs, 6 DOTs, 2 MADs, 1 POW,
and 2 ADDs. After rescheduling of the code to avoid the
dependencies, the required execution cycle for it on the
proposed processor is 38 cycles. Table II shows the
comparison results and the peak TFM performance is also
compared. Our work shows 17.5% and 22.2% performance
improvements for TFM and TnL, respectively, while reducing
44.7% and 39.4% power and area overhead, respectively, from
the latest work [2].
TABLE II
THE COMPARISON RESULTS
Performance (Mvertices/s) Ref TFM TnL
Power
(mW)
Area
(mm2)
Freq.
(MHz)
Process
(µm)
[8] 36 3.85 250 N.A. 400 0.13
[5] 50 3.6 75.4 10.2 200 0.18
[9] 33 7.55 N.A. N.A. 166 0.13
[2] 120 9.9 157 16 100 0.18
This work 141 12.1 86.8 9.7 200 0.18
V. CONCLUSION
A high-performance, power- and area-efficient vector
processor is proposed for 3D graphics shaders. It adopts a 4-
way 32-bit FLP multifunction unit. Using logarithmic
arithmetic, the unit unifies the vector, matrix, and elementary
functions into a single arithmetic unit and achieves single-
cycle throughput for all the operations, except for the MAT
with 2-cycle throughput. With the help of the multifunction
unit together with cascaded integer-float datapath, datapath
reconfiguration, logarithmic-domain forwarding, and 58% hit
rate vertex cache achieves 141Mvertices/s for TFM and
12.1Mvertices/s for TnL at 200MHz. Comparing with
previous work, it shows 17.5% and 22.2% performance
improvements for TFM and TnL, respectively, while reducing
44.7% and 39.4% power and area overhead, respectively.
REFERENCES
[1] Khronos Group, OpenGL ES 2.0, http://www.khronos.org
[2] C.-H. Yu, et al., “A 120Mvertices/s Multi-threaded VLIW Vertex
Processor for Mobile Multimedia Applications,” in IEEE ISSCC Dig.
Tech. Papers, Feb. 2006.
[3] B.-G. Nam, et al., A Low-Power Unified Arithmetic Unit for
Programmable Handheld 3-D Graphics Systems,” in Proc. IEEE CICC,
Sept. 2006.
[4] F. Lai, et al., “A Hybrid Number System Processor with Geometric and
Complex Arithmetic Capabilities,” IEEE Trans. on Computer, Vol. 40,
No. 8, August 1991.
[5] J.H Sohn, et al., “A Fixed-point Multimedia Co-processor with
50Mvertices/s Programmable SIMD Vertex Shader for Mobile
Applications,” in Proc. ESSCIRC, Sept. 2005.
[6] B.-G. Nam, et al., “A 52.4mW 3D Graphics Processor with
141Mvertices/s Vertex Shader and 3 Power Domains of Dynamic
Voltage and Frequency Scaling,” in IEEE ISSCC Dig. Tech. Papers, Feb.
2007.
[7] OpenGL ARB, http://www.opengl.org.
[8] F. Arakawa, et al., “An Embedded Processor Core for Consumer
Appliances with 2.8GFLOPS and 36Mpolygons/s,” in IEEE ISSCC Dig.
Tech. Papers, Feb. 2004.
[9] D. Kim, et al., “An SoC with 1.3 Gtexels/s 3-D Graphics Full Pipeline for
Consumer Applications,” IEEE JSSC, vol. 41, no. 1, pp 71-84, Jan. 2006.
235
Main Menu
CIRC MENU
Front Matter
Table of Contents
Author Index
Search
Print
View Full Page
Zoom In
Zoom Out
Go To Previous Document
Help