首页 AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness

AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness

举报
开通vip

AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness Sunpyo Hong Electrical and Computer Engineering Georgia Institute of Technology shong9@gatech.edu Hyesoon Kim School of Computer Science Georgia Institute ...

AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness
An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness Sunpyo Hong Electrical and Computer Engineering Georgia Institute of Technology shong9@gatech.edu Hyesoon Kim School of Computer Science Georgia Institute of Technology hyesoon@cc.gatech.edu ABSTRACT GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thou- sands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application per- formance is even more difficult. Current approaches rely on pro- grammers to tune their applications by exploiting the design space exhaustively without fully understanding the performance charac- teristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel pro- grams. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp paral- lelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA program- ming language. Categories and Subject Descriptors C.1.4 [Processor Architectures]: Parallel Architectures ; C.4 [Performance of Systems]: Modeling techniques ; C.5.3 [Computer System Implementation]: Microcomputers General Terms Measurement, Performance Keywords Analytical model, CUDA, GPU architecture, Memory level paral- lelism, Warp level parallelism, Performance estimation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISCA’09, June 20–24, 2009, Austin, Texas, USA. Copyright 2009 ACM 978-1-60558-526-0/09/06 ...$5.00. 1. INTRODUCTION The increasing computing power of GPUs gives them consid- erably higher peak computing power than CPUs. For example, NVIDIA’s GTX280 GPUs [3] provide 933 Gflop/s with 240 cores, while Intel’s Core2Quad processors [2] deliver only 100 Gflop/s. Intel’s next generation of graphics processors will support more than 900 Gflop/s [26]. AMD/ATI’s latest GPU (HD4870) provides 1.2 Tflop/s [1]. However, even though hardware is providing high performance computing, writing parallel programs to take full ad- vantage of this high performance computing power is still a big challenge. Recently, there have been new programming languages that aim to reduce programmers’ burden in writing parallel applications for the GPUs such as Brook+ [5], CUDA [22], and OpenCL [16]. However, even with these newly developed programming languages, programmers still need to spend enormous time and effort to op- timize their applications to achieve better performance [24]. Al- though the GPGPU community [11] provides general guidelines for optimizing applications using CUDA, clearly understanding var- ious features of the underlying architecture and the associated per- formance bottlenecks in their applications is still remaining home- work for programmers. Therefore, programmers might need to vary all the combinations to find the best performing configura- tions [24]. To provide insight into performance bottlenecks in massively parallel architectures, especially GPU architectures, we propose a simple analytical model. The model can be used statically with- out executing an application. The basic intuition of our analytical model is that estimating the cost of memory operations is the key component of estimating the performance of parallel GPU appli- cations. The execution time of an application is dominated by the latency of memory instructions, but the latency of each memory op- eration can be hidden by executing multiple memory requests con- currently. By using the number of concurrently running threads and the memory bandwidth consumption, we estimate how many mem- ory requests can be executed concurrently, which we call memory warp1 parallelism (MWP).We also introduce computation warp parallelism (CWP). CWP represents how much computation can be done by other warps while one warp is waiting for memory val- ues. CWP is similar to a metric, arithmetic intensity2[23] in the GPGPU community. Using both MWP and CWP, we estimate ef- fective costs of memory requests, thereby estimating the overall execution time of a program. We evaluate our analytical model based on the CUDA [20, 22] 1A warp is a batch of threads that are internally executed together by the hardware. Section 2 describes a warp. 2Arithmetic intensity is defined as math operations per memory operation. programming language, which is C with extensions for parallel threads. We compare the results of our analytical model with the actual execution time on several GPUs. Our results show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on the Merge benchmarks [17]3 is 13.3% The contributions of our work are as follows: 1. To the best of our knowledge, we propose the first analytical model for the GPU architecture. This can be easily extended to other multithreaded architectures as well. 2. We propose two new metrics, MWP and CWP, to represent the degree of warp level parallelism that provide key insights identifying performance bottlenecks. 2. BACKGROUND AND MOTIVATION We provide a brief background on the GPU architecture and pro- gramming model that we modeled. Our analytical model is based on the CUDA programming model and the NVIDIA Tesla archi- tecture [3, 8, 20] used in the GeForce 8-series GPUs. 2.1 Background on the CUDA Programming Model The CUDA programming model is similar in style to a single- program multiple-data (SPMD) software model. The GPU is treated as a coprocessor that executes data-parallel kernel functions. CUDA provides three key abstractions, a hierarchy of thread groups, shared memories, and barrier synchronization. Threads have a three level hierarchy. A grid is a set of thread blocks that execute a kernel function. Each grid consists of blocks of threads. Each block is composed of hundreds of threads. Threads within one block can share data using shared memory and can be synchronized at a barrier. All threads within a block are executed concurrently on a multithreaded architecture. The programmer specifies the number of threads per block, and the number of blocks per grid. A thread in the CUDA program- ming language is much lighter weight than a thread in traditional operating systems. A thread in CUDA typically processes one data element at a time. The CUDA programming model has two shared read-write memory spaces, the shared memory space and the global memory space. The shared memory is local to a block and the global memory space is accessible by all blocks. CUDA also pro- vides two read-only memory spaces, the constant space and the texture space, which reside in external DRAM, and are accessed via read-only caches. 2.2 Background on the GPU Architecture Figure 1 shows an overview of the GPU architecture. The GPU architecture consists of a scalable number of streaming multipro- cessors (SMs), each containing eight streaming processor (SP) cores, two special function units (SFUs), a multithreaded instruction fetch and issue unit, a read-only constant cache, and a 16KB read/write shared memory [8]. The SM executes a batch of 32 threads together called a warp. Executing a warp instruction applies the instruction to 32 threads, similar to executing a SIMD instruction like an SSE instruction [14] in X86. However, unlike SIMD instructions, the concept of warp is not exposed to the programmers, rather programmers write a pro- gram for one thread, and then specify the number of parallel threads in a block, and the number of blocks in a kernel grid. The Tesla ar- chitecture forms a warp using a batch of 32 threads [13, 9] and in the rest of the paper we also use a warp as a batch of 32 threads. 3The Merge benchmarks consist of several media processing appli- cations. Stream Processor Stream Processor Stream Processor Stream Processor PC th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad th re ad ... ... Global Memory (Device Memory) SIM D Execution U nit block ... ... ... Streamming Multiprocessor (Multithreaded processor) Streamming Multiprocessor (Multithreaded processor) Streamming Multiprocessor (Multithreaded processor) Streamming Multiprocessor (Multithreaded processor) th re ad th re ad th re ad block ... warp ... ... block ... ... ... block ... ... ... warp warp warp warp warpwarpwarp I−cache Decoder Shared Memory Interconnection Network Figure 1: An overview of the GPU architecture All the threads in one block are executed on one SM together. One SM can also have multiple concurrently running blocks. The number of blocks that are running on one SM is determined by the resource requirements of each block such as the number of registers and shared memory usage. The blocks that are running on one SM at a given time are called active blocks in this paper. Since one block typically has several warps (the number of warps is the same as the number of threads in a block divided by 32), the total number of active warps per SM is equal to the number of warps per block times the number of active blocks. The shared memory is implemented within each SM multipro- cessor as an SRAM and the global memory is part of the offchip DRAM. The shared memory has very low access latency (almost the same as that of register) and high bandwidth. However, since a warp of 32 threads access the shared memory together, when there is a bank conflict within a warp, accessing the shared memory takes multiple cycles. 2.3 Coalesced and Uncoalesced Memory Ac- cesses The SM processor executes one warp at one time, and sched- ules warps in a time-sharing fashion. The processor has enough functional units and register read/write ports to execute 32 threads (i.e. one warp) together. Since an SM has only 8 functional units, executing 32 threads takes 4 SM processor cycles for computation instructions.4 When the SM processor executes a memory instruction, it gen- erates memory requests and switches to another warp until all the memory values in the warp are ready. Ideally, all the memory ac- cesses within a warp can be combined into one memory transac- tion. Unfortunately, that depends on the memory access pattern within a warp. If the memory addresses are sequential, all of the memory requests within a warp can be coalesced into a single mem- ory transaction. Otherwise, each memory address will generate a different transaction. Figure 2 illustrates two cases. The CUDA manual [22] provides detailed algorithms to identify types of co- alesced/uncoalesced memory accesses. If memory requests in a warp are uncoalesced, the warp cannot be executed until all mem- ory transactions from the same warp are serviced, which takes sig- nificantly longer than waiting for only one memory request (coa- lesced case). 4In this paper, a computation instruction means a non-memory in- struction. Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 32 (a) A Single Memory Transaction (b) Addr 1 Addr 2 Addr 3 Addr 31 Addr 32 Multiple Memory Transactions �� ��� �� �� ��� �� �� ��� �� �� ��� � �� ��� � �� ��� �� �� ��� ��� �� ��� �� �� ��� �� �� ��� �� �� ��� ��� �� ��� ��� Figure 2: Memory requests from a single warp. (a) coalesced memory access (b) uncoalesced memory access 2.4 Motivating Example To motivate the importance of a static performance analysis on the GPU architecture, we show an example of performance differ- ences from three different versions of the same algorithm in Fig- ure 3. The SVM benchmark is a kernel extracted from a face clas- sification algorithm [28]. The performance of applications is mea- sured on NVIDIA QuadroFX5600 [4]. There are three different optimized versions of the same SVM algorithm: Naive, Constant, and Constant+Optimized. Naive uses only the global memory, Constant uses the cached read-only constant memory5, and Con- stant+Optimized also optimizes memory accesses6 on top of using the constant memory. Figure 3 shows the execution time when the number of threads per block is varied. In this example, the number of blocks is fixed so the number of threads per block determines the total number of threads in a program. The performance improve- ment of Constant+Optimized and that of Constant over the Naive implementation are 24.36x and 1.79x respectively. Even though the performance of each version might be affected by the number of threads, once the number of threads exceeds 64, the performance does not vary significantly. 800 1000 1200 1400 E xe cu ti o n T im e ( m s) 0 200 400 600 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484 E xe cu ti o n T im e ( m s) THREADS PER BLOCK Naïve Constant Constant + Optimized Figure 3: Optimization impacts on SVM Figure 4 shows SM processor occupancy [22] for the three cases. The SM processor occupancy indicates the resource utilization, which has been widely used to optimize GPU computing applications. It is calculated based on the resource requirements for a given pro- gram. Typically, high occupancy (the max value is 1) is better for performance since many actively running threads would more likely hide the DRAM memory access latency. However, SM pro- cessor occupancy does not sufficiently estimate the performance 5The benefits of using the constant memory are (1) it has an on- chip cache per SM and (2) using the constant memory can reduce register usage, which might increase the number of running blocks in one SM. 6The programmer optimized the code to have coalesced memory accesses instead of uncoalesced memory accesses. 0.5 0.6 0.7 0.8 0.9 1 O c c u p a n c y 0 0.1 0.2 0.3 0.4 0.5 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484 O c c u p a n c y THREADS PER BLOCK Naïve Constant Constant+Optimized Figure 4: Occupancy values of SVM improvement as shown in Figure 4. First, when the number of threads per block is less than 64, all three cases show the same occupancy values even though the performances of 3 cases are dif- ferent. Second, even though SM processor occupancy is improved, for some cases, there is no performance improvement. For exam- ple, the performance of Constant is not improved at all even though the SM processor occupancy is increased from 0.35 to 1. Hence, we need other metrics to differentiate the three cases and to understand what the critical component of performance is. 3. ANALYTICAL MODEL 3.1 Introduction to MWP and CWP The GPU architecture is a multithreaded architecture. Each SM can execute multiple warps in a time-sharing fashion while one or more warps are waiting for memory values. As a result, the ex- ecution cost of warps that are executed concurrently can be hid- den. The key component of our analytical model is finding out how many memory requests can be serviced and how many warps can be executed together while one warp is waiting for memory values. To represent the degree of warp parallelism, we introduce two metrics, MWP (Memory Warp Parallelism) and CWP (Computa- tion Warp Parallelism). MWP represents the maximum number of warps per SM that can access the memory simultaneously during the time period from right after the SM processor executes a mem- ory instruction from one warp (therefore, memory requests are just sent to the memory system) until all the memory requests from the same warp are serviced (therefore, the processor can execute the next instruction from that warp). The warp that is waiting for mem- ory values is called a memory warp in this paper. The time period from right after one warp sent memory requests until all the mem- ory requests from the same warp are serviced is called one memory warp waiting period. CWP represents the number of warps that the SM processor can execute during one memory warp waiting pe- riod plus one. A value one is added to include the warp itself that is waiting for memory values. (This means that CWP is always greater than or equal to 1.) MWP is related to how much memory parallelism in the system. MWP is determined by the memory bandwidth, memory bank par- allelism and the number of running warps per SM. MWP plays a very important role in our analytical model. When MWP is higher than 1, the cost of memory access cycles from (MWP-1) number of warps is all hidden, since they are all accessing the memory sys- tem together. The detailed algorithm of calculating MWP will be described in Section 3.3.1. CWP is related to the program characteristics. It is similar to an arithmetic intensity, but unlike arithmetic intensity, higher CWP means less computation per memory access. CWP also considers timing information but arithmetic intensity does not consider tim- ing information. CWP is mainly used to decide whether the total execution time is dominated by computation cost or memory access cost. When CWP is greater than MWP, the execution cost is domi- nated by memory access cost. However, when MWP is greater than CWP, the execution cost is dominated by computation cost. How to calculate CWP will be described in Section 3.3.2. 3.2 The Cost of Executing Multiple Warps in the GPU architecture To explain how executing multiple warps in each SM affects the total execution time, we will illustrate several scenarios in Fig- ures 5, 6, 7 and 8. A computation period indicates the period when instructions from one warp are executed on the SM processor. A memory waiting period indicates the period when memory requests are being serviced. The numbers inside the computation period boxes and memory waiting period boxes in Figures 5, 6, 7 and 8 indicate a warp identification number. 3.2.1 CWP is Greater than MWP � � � � � � � � � %CUG�� %CUG�� �� � � � � � � � �� � � � � � � � � �� � � � � � � ,GOH�F\FOHV � � � � � � � &RPSXWDWLRQ � � 0HPRU\ &:3 � 0:3 � � &RPSXWDWLRQ � � 0HPRU\ 0HPRU\ :DLWLQJ�SHULRG C� D� ��&RPSXWDWLRQ�����0HPRU\��&RPSXWDWLRQ�����0HPRU\ �UV %QORWVCVKQP�RGTKQF��UV /GOQT[�RGTKQF� �PF %QORWVCVKQP RGTKQF�PF /GOQT[ RGTKQF �PF %QORWVCVKQP�RGTKQF��PF /GOQT[�RGTKQF� Figure 5: Total execution time when CWP is greater than MWP: (a) 8 warps (b) 4 warps For Case 1 in Figure 5a, we assume that all the computation pe- riods and memory waiting periods are from different warps. The system can service two memory warps simultaneously. Since one computation period is roughly one third of one memory waiting warp period, the processor can finish 3 warps’ computation peri- ods during one memory waiting warp period. (i.e., MWP is 2 and CWP is 4 for this case.) As a result, the 6 computation periods are completely overlapped with other memory waiting periods. Hence, only 2 computations and 4 memory waiting periods contribute to the total execution cycles. For Case 2 in Figure 5b, there are four warps a
本文档为【AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_476142
暂无简介~
格式:pdf
大小:612KB
软件:PDF阅读器
页数:12
分类:互联网
上传时间:2012-10-23
浏览量:3