首页 > > > A Low-Power Vector Processor Using Logarithmic …

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

A Low-Power Vector Processor Us…

上传者: rapanda 2013-12-10 评分1 评论0 下载0 收藏10 阅读量616 暂无简介 简介 举报

简介:本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》,可适用于硬件技术领域,主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems Byeong-Gyu Nam and Hoi-Jun Yoo Dept. of EECS, Korea Advanced Institute of Science and Technology (KAIST) 373-1, Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea Abstract- A low-power, high-performance 4-way 32-bit vector processor is developed for handheld 3D graphics systems. It contains a floating-point unified matrix, vector, and elementary function unit. By utilizing the logarithmic arithmetic, the unit achieves single-cycle throughput for all these operations except for the matrix-vector multiplication with 2-cycle throughput. The processor featured by this function unit, cascaded integer-float datapath, reconfiguration of datapath, operand forwarding in logarithmic domain, and vertex cache takes 9.7mm2 in 0.18µm CMOS technology and achieves 141Mvertices/s for geometry transformation and 12.1Mvertices/s for OpenGL transformation and lighting at 200MHz with 86.6mW power consumption. I. INTRODUCTION The handheld graphics processing units (GPUs) incorporate vector processors, known as shaders, in their 3D graphics pipeline stages to provide more realistic images [1]. In [2], a vertex shader is proposed with 16-way floating-point (FLP) multipliers for the fast geometry transformation and it consumed a large silicon area and power. For the power- and area-efficient design of the shaders, a multifunction unit was proposed in [3]. However, it was a fixed-point unit and didn’t deal with the matrix-vector multiplication, frequently used for 3D geometry transformations. In this paper, a 4-way 32-bit FLP vector processor is proposed for the shaders. It adopts a unified matrix, vector, and elementary function unit, which unifies all these operations in a single 4-way arithmetic unit. The unit operates on the FLP data since the newly defined graphics API requires more than 24-bit FLP precision [1]. Although it operates on the FLP data, it uses the logarithmic arithmetic for the internal arithmetic to reduce the arithmetic complexity. Its instruction set includes matrix-vector multiplication (MAT), vector operations (VEC) like vector multiplication (MUL), division (DIV), divide by square-root (DSQ), multiply-add (MAD) and dot-product (DOT), and elementary functions (ELM) including trigonometric functions (TRGs), power (POW), and logarithm (LOG). It achieves single-cycle throughput with maximum 5-cycle latency for all these operations except for the MAT with 2-cycle throughput and 6-cycle latency. The processor has a cascaded structure of integer and FLP datapaths for efficient indexing of FLP operands. The configuration of the FLP multifunction unit is supported to extend the instruction set according to user’s requirements. Its pipeline exploits the operand forwarding in the logarithmic domain to improve pipeline throughput and the computation accuracy. A vertex cache is adopted to reuse the previously processed results and improves throughput of the processor. II. ARITHMETIC UNIT A. Number System The proposed arithmetic unit is based on the hybrid approach of the FLP and the logarithmic number system (LNS) introduced in [4], where operations are reduced into simpler ones in the LNS while the addition and subtraction are performed in FLP since the LNS addition and subtraction require nonlinear term evaluations. The logarithmic and antilogarithmic converters between the FLP and the LNS are proposed for this hybrid number system (HNS). 15-entry LUT (64B) x m>>ci m>>di m>>ei CSA CPA CSA CSA ci di ei bi m e aim+bi 8entry LUT (56B) f>>ci f>>di CPA CSA CSA ci di bi X e f ai f+bi (a) Logarithmic converter (b) Antilogarithmic converter Fig. 1. Proposed number converters 1) Logarithmic Converter For the 32-bit FLP input 2 (1 )ex m= + , its logarithmic number is represented as 2 2log log (1 )x e m= + + . The exponent e is the integer part of the logarithmic number and 2log (1 )m+ is the fractional part. The 2log (1 )m+ is approximated by piecewise linear expressions as 2log (1 )m a m b+ + , where a and b are the approximation coefficients defined for each approximation region. The logarithmic converter divides the (1+m) into finer subdivisions around the region near to 1 since the error increases as the input value gets closer to 1. It achieves maximum 0.41% conversion error with 15 approximation regions. Fig. 1(a) shows the proposed logarithmic converter. 2) Antilogarithmic Converter For the logarithmic number X, its FLP number can be represented by 2 2 2 2X e f e f+= = . The integer part e directly becomes the exponent of the FLP number and the nonlinear term 2 f is approximated by piecewise linear expressions like 2 f a f b + , where a and b are the approximation coefficients defined for each approximation region. The antilogarithmic converter evenly divides the f into the approximation regions since the antilogarithmic conversion error is spread over the entire input region evenly. It achieves 1-4244-1125-4/07/$25.00 2007 IEEE. 232 maximum 0.08% conversion error with 8 approximation regions. Fig. 1(b) shows the antilogarithmic converter. B. Organization The arithmetic unit is organized with 4-channels and 5 pipeline stages as shown in Fig. 2. The 4 of input 32-bit FLP operands are converted into the logarithmic number with 8-bit integer, 24-bit fraction, 1-bit sign, and 1-bit zero through the 4 logarithmic converters (LOGCs) in the E1. The E2 includes the programmable multiplier (PMUL) as shown in Fig. 3, which can be used for the Booth multiplier for ELM, 4 LOGCs for VEC or 4 antilogarithmic converters (ALOGCs) for MAT by just adding 15-entry 64B LOG and 8-entry 56B ALOG lookup-tables to the Booth multiplier and sharing the CSA tree and a CPA. In this way, the number of LOGCs in E1 is reduced to 4, which were 8 in [3]. In E3, 4 adders in logarithmic domain are provided for the VECs and the resulting values are converted into the FLP numbers through the 4 ALOGCs. The programmable adder (PADD) in the E4 can be programmed into a 5-input FLP adder tree or 4-way 2- input SIMD FLP adders for target operations as proposed in [3]. The E5 provides a SIMD FLP accumulator for the final accumulation required for the MAT. It is also used as a rounding logic for other operations. Channel 3Channel 2Channel 1Channel 0 LOGC ALOGC x1 LOGC ALOGC LOGC ALOGC CPACPACPA >> >> >> LOGC ALOGC y0 z0 CPA >> CPA y1 z1 CPA y2 z2 CPA y3 z3 CPA x2 x3 x0 MAT 2nd x2 x3 MAT 2nd fpAcc fpAcc fpAcc fpAcc TRGTRGTRGTRG MAT POW >> MAT MAT MATMATMAT MAT Programmable Multiplier CPA CPAProgrammable Adder 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 10 10 10 10 01 10 01 01 10 01 01 10 10 10 01101010 Fig. 2. Proposed arithmetic unit 1) Matrix-Vector Multiplication The geometry transformation in 3D graphics can be computed by the multiplication of 44-matrix with 4-element vector, which requires 16 multiplications and 12 additions. This can be converted into the HNS as in Fig. 4 requiring 20 LOGCs, 16 adders, 16 ALOGCs, and 12 FLP adders. Since the coefficients of a geometry transformation matrix are fixed during processing of 3D object, these can be pre-converted into the logarithmic domain and used as constants during the processing. Thus, the MAT only requires 4 LOGCs for vector element conversion, 16 adders in logarithmic domain, 16 ALOGCs and 12 FLP adders. This can be implemented in 2 phases on this 4-way arithmetic unit as illustrated in Fig. 4. In this scheme, 8 adders in logarithmic domain and 8 ALOGCs are required per phase and the 8 ALOGCs in the first phase are obtained from 4 ALOGCs in E2 by programming the PMUL into 4 ALOGCs together with the 4 ALOGCs in E3. The CPAs in E1 and E3 are used for the 8 adders in logarithmic domain. The 4 multiplication results from the ALOGCs in E2 and the other 4 from the E3 are added in the E4 by programming the PADD into 4-way SIMD FLP adder to get the first phase result. With the same process repeated, the accumulation with the first phase result in E5 completes the MAT. Thus, the MAT is implemented with 2-cycle throughput on this 4-way arithmetic unit, where it was implemented with 4-cycle throughput in conventional way [3][5]. Fig. 3. Programmable multiplier (PMUL) 00 01 02 03 00 030 01 02 10 11 12 13 10 131 11 12 0 1 2 3 20 21 22 23 2 20 21 22 23 3 31 3230 31 32 33 30 33 c c c c c cx c c c c c c c cx c c x x x x c c c c x c c c c x c cc c c c c c = + + + = 2 00 2 032 01 2 02 2 10 2 132 11 2 12 2 0 2 1 2 2 2 20 2 232 21 2 22 2 30 2 31 2 32 2 33 log loglog log log loglog loglog log loglog loglog log log log log log2 2 2 2 c cc c c cc cx x xc cc c c c c c + + + + + + 2 3log x + Fig. 4. Two-phase implementation of MAT 2) Vector and Elementary Functions The VECs and ELMs including the MUL, DIV, DSQ, MAD, DOT, POW, LOG, and TRGs are implemented based on the scheme in [3]. Since the vector operations require 2 LOGCs for 2 operands per channel, the PMUL is programmed into 4 LOGCs to make the 8 LOGCs for 4 channels together with the 4 LOGCs in E1. Since the power is converted into the multiplication in logarithmic domain, it requires a 32b24b multiplier and it is 233 implemented by programming the PMUL into a single 32b24b BMUL. The TRGs are unified with others using the Taylor series as proposed in [3]. This power series requires a 4-way 32b6b multiplier in logarithmic domain and final summation of these terms. This can be implemented by programming the PMUL into 4-way 32b6b BMUL and the PADD into a 5-input summation tree. III. PROCESSOR ARCHITECTURE A. Instruction Set Architecture The processor has two types of instruction format with 32- bits and 64-bits. The 32-bit instructions are used for control and integer (INT) operations, while the 64-bit instructions are used for the FLP operations. This separation of format rather than a single 64-bit format results in reduced instruction memory (IMEM) size and power dissipation. In a loop execution, the register indices for FLP operands may be calculated dynamically as a function of the loop counter i as follows. OPR Dst[i+kd],SrcA[i+ka],SrcB[i+kb],SrcC[i+kc] Thus, the INT instructions calculating the indices can be embedded in the FLP instruction as the FLP operand fields for efficient calculation of the indices. In this case, INT computation is followed by the FLP computation by a single 64-bit instruction shown in Fig. 5. Fig. 5. 64-bit FLP instruction format B. Micro-architecture Fig. 6 shows the micro-architecture for the proposed processor. It has 4-way 32-bit FLP vector register files including 512-byte 32-entry vertex input register (VIR), general purpose register file (GPR), 4KB 256-entry constant memory (CMEM), and 256-byte 16-entry vertex output register (VOR). The FLP operands are fetched from the GPR, VIR or CMEM and the result is written back to the GPR or VOR indexing the target entry in the destination register file. The FLP operands can be swizzled, negated, and converted into the absolute value by the source modifiers. The arithmetic unit proposed in section 2 is used as the FLP multifunction unit in this processor. The vertex cache is composed of 16 VORs. 1) Cascaded INT-FLP datapaths This processor has a cascaded architecture of 4-way 8-bit INT and 4-way 32-bit FLP datapaths to implement the embedded index calculation of the FLP operands without using additional cycles. For flexible index calculation, this processor includes 4-way 8-bit SIMD integer ALU and 64- byte 16-entry integer register file (IGPR). The 4 of 8-bit results from this unit index 3 source operands and 1 destination of a FLP operation. When a multiplication is required for the index calculation, the PMUL in FLP arithmetic unit is programmed into a 32b24b integer multiply-add (IMAD) unit since the conversion error is not allowed for the index calculation. 2) Datapath Reconfiguration The FLP unit in section II has several MUXes to unify various operations and reveals all the control points to the programmer so that they can make arbitrary operations on it by programming the 91-bit control signals. For example, various TRGs [3] supported in this processor can be programmed by a single configuration instruction (CFG) with some configuration data rather than including all of the TRGs in the limited instruction space. The programmed configuration data are stored in the 64-byte 4-entry configuration register file (CFR) and accessed by the CFG. Thus, the configuration can be changed in every cycle. IMEM (2KB) GPR (512B) IGPR (64B) SWZ NEG ABS SWZ NEG ABS SWZ NEG ABS FLP multifunction unit ALU 8b 8b 8b 8b 32b 128b 128b 128b 128b 128b 128b 32b 128b Fetch Decode 32b 128b Vertex Fetch 128b Vertex CMEM (4KB) VIR (512B) DMEM (2KB) 128b 128b CFR (64B) 91b configuration Float Dec. 0 1 CFG 2b 91b ctrl IndexIndex Index IndexData Data Data Index Data Vertex Cache (16 VORs) (4KB) Fig. 6. Micro-architecture of proposed processor 3) Logarithmic-domain Forwarding The operand forwarding improves the throughput of the processor pipeline. In this processor, the operand forwarding is also supported in the logarithmic domain. As shown in Fig. 7, in the case of consecutive FLP operations without requiring final FLP adders, the antilogarithmic and logarithmic conversions cancel each other and the intermediate logarithmic value of the previous FLP operation is forwarded directly into the logarithmic domain of the next FLP operation bypassing the antilogarithmic and logarithmic converters of two operations. This reduces the pipeline latency and the computation error by bypassing the repeated antilogarithmic and logarithmic conversions, which are the sources of errors. 234 E4 E5E1 Log E2 E2 E3 Alog E4 E5 E1 Log E2 E1 Log E2 E3 Alog E3 Alog E4 E5 E4 E5 a b t c a b c op1 op2 op1 op2 Log Alog Log Log Log Alog Log Log Log Alog Logarithmic-domain forwarding Conventional forwarding Cancelled E1 Log E3 Alog cycle reduction Fig. 7. Logarithmic-domain forwarding 4) Vertex Cache A transformation and lighting (TnL) vertex cache is provided to reuse the previously processed vertices without executing the TnL routine. The 4KB SRAM can contain 16 result vertices with 58% hit rate. This leads to a single-cycle TnL for the vertices in the vertex cache and peak geometry transformation (TFM) of 141Mvertices/s together with the 2- cycle MAT. IV. IMPLEMENTION RESULTS A. Chip Implementation The proposed processor is integrated into a 3D graphics chip as a vertex shader and fabricated in 0.18µm 6-metal CMOS technology [6]. The chip micrograph and summary of chip characteristics are shown in Fig. 8. The core size is 9.7mm2 and operates at 200MHz consuming 86.8mW at 1.8V. Fig. 9 shows the shmoo. The clock to pipeline registers is fully gated to disable unnecessary switching under the control of each instruction. All the FLP instructions are processed with single-cycle throughput, except for the MAT with 2-cycle throughput. Table I shows the latencies and throughputs for FLP instructions. Fig. 8. Chip micrograph O pe ra tin g Fr eq ue nc y (M H z) Fig. 9. Shmoo TABLE I THE LATENCY/THROUGHPUT OF SELECTED INSTRUCTIONS ADD DIV DSQ MAD DOT MAT POW SIN Throughput 1 3 3 5 5 6 3 5 Latency 1 1 1 1 1 2 1 1 B. Comparison The performance is compared for the full OpenGL TnL routine [7] with model-view, normal, and perspective transformations, normalizations of light, view, normal, and Blinn half vectors, and intensity calculations of diffuse and specular lightings for a single light source. The routine includes 3 MATs, 4 DSQs, 2 DIVs, 6 DOTs, 2 MADs, 1 POW, and 2 ADDs. After rescheduling of the code to avoid the dependencies, the required execution cycle for it on the proposed processor is 38 cycles. Table II shows the comparison results and the peak TFM performance is also compared. Our work shows 17.5% and 22.2% performance improvements for TFM and TnL, respectively, while reducing 44.7% and 39.4% power and area overhead, respectively, from the latest work [2]. TABLE II THE COMPARISON RESULTS Performance (Mvertices/s) Ref TFM TnL Power (mW) Area (mm2) Freq. (MHz) Process (µm) [8] 36 3.85 250 N.A. 400 0.13 [5] 50 3.6 75.4 10.2 200 0.18 [9] 33 7.55 N.A. N.A. 166 0.13 [2] 120 9.9 157 16 100 0.18 This work 141 12.1 86.8 9.7 200 0.18 V. CONCLUSION A high-performance, power- and area-efficient vector processor is proposed for 3D graphics shaders. It adopts a 4- way 32-bit FLP multifunction unit. Using logarithmic arithmetic, the unit unifies the vector, matrix, and elementary functions into a single arithmetic unit and achieves single- cycle throughput for all the operations, except for the MAT with 2-cycle throughput. With the help of the multifunction unit together with cascaded integer-float datapath, datapath reconfiguration, logarithmic-domain forwarding, and 58% hit rate vertex cache achieves 141Mvertices/s for TFM and 12.1Mvertices/s for TnL at 200MHz. Comparing with previous work, it shows 17.5% and 22.2% performance improvements for TFM and TnL, respectively, while reducing 44.7% and 39.4% power and area overhead, respectively. REFERENCES [1] Khronos Group, OpenGL ES 2.0, http://www.khronos.org [2] C.-H. Yu, et al., “A 120Mvertices/s Multi-threaded VLIW Vertex Processor for Mobile Multimedia Applications,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2006. [3] B.-G. Nam, et al., A Low-Power Unified Arithmetic Unit for Programmable Handheld 3-D Graphics Systems,” in Proc. IEEE CICC, Sept. 2006. [4] F. Lai, et al., “A Hybrid Number System Processor with Geometric and Complex Arithmetic Capabilities,” IEEE Trans. on Computer, Vol. 40, No. 8, August 1991. [5] J.H Sohn, et al., “A Fixed-point Multimedia Co-processor with 50Mvertices/s Programmable SIMD Vertex Shader for Mobile Applications,” in Proc. ESSCIRC, Sept. 2005. [6] B.-G. Nam, et al., “A 52.4mW 3D Graphics Processor with 141Mvertices/s Vertex Shader and 3 Power Domains of Dynamic Voltage and Frequency Scaling,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007. [7] OpenGL ARB, http://www.opengl.org. [8] F. Arakawa, et al., “An Embedded Processor Core for Consumer Appliances with 2.8GFLOPS and 36Mpolygons/s,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2004. [9] D. Kim, et al., “An SoC with 1.3 Gtexels/s 3-D Graphics Full Pipeline for Consumer Applications,” IEEE JSSC, vol. 41, no. 1, pp 71-84, Jan. 2006. 235 Main Menu CIRC MENU Front Matter Table of Contents Author Index Search Print View Full Page Zoom In Zoom Out Go To Previous Document Help

该用户的其他资料

  • 名称/格式
  • 评分
  • 下载次数
  • 资料大小
  • 上传时间

用户评论

0/200
    暂无评论
上传我的资料

相关资料

资料评价:

/ 4
所需积分:1 立即下载
返回
顶部
举报
资料
关闭

温馨提示

感谢您对爱问共享资料的支持,精彩活动将尽快为您呈现,敬请期待!