On the Efﬁcacy of a Fused CPU+GPU Processor

On the Efﬁcacy of a Fused CPU+GPU Processor On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing Mayank Daga, Ashwin M. Aji, and Wu-chun Feng Dept. of Computer Science Virginia Tech Blacksburg, USA {mdaga, aaji, feng}@cs.vt.edu Abstract—The graphics processing unit (GPU) has...

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing Mayank Daga, Ashwin M. Aji, and Wu-chun Feng Dept. of Computer Science Virginia Tech Blacksburg, USA {mdaga, aaji, feng}@cs.vt.edu Abstract—The graphics processing unit (GPU) has made sig- nificant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that “fuse” the function- ality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general- purpose x86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynam- ics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores. Keywords-AMD Fusion; graphics processing unit; GPU; GPGPU; accelerated processing unit; APU; OpenCL; perfor- mance evaluation; benchmarking; heterogeneous computing; I. INTRODUCTION The widespread adoption of compute-capable graphics pro- cessing units (GPUs) in desktops and workstations has made them attractive as accelerators for high-performance parallel computing [1]. Their increased popularity has been due in part to a unique amalgamation of performance, power, and energy efficiency. In fact, three out of the top five fastest supercomputers in the world, according to the Top500, employ GPUs [2]. A closer look at the Top500 list, however, reveals that the supercomputers powered by GPUs attain only ∼50% of their theoretical peak performance, whereas non-GPU-powered su- percomputers attain ∼78% of the theoretical peak [2]. This implies that there are certain aspects of GPUs that limit their performance for Linpack, the benchmark used to rank supercomputers on the Top500. Since GPUs have traditionally resided on PCI Express (PCIe), additional overhead costs are incurred for host-to-GPU data transfers and vice versa. As a consequence, GPU applications are oftentimes bottlenecked by the PCIe data transfers. Thus, GPUs are not a panacea [3]–[5]. With the emergence of heterogeneous computing architec- tures that “fuse” the functionality of the CPU and GPU onto the same die, e.g., AMD Fusion and Intel Knights Ferry, there is an expectation that the PCIe bottlenecks would be addressed. In these architectures, the x86 CPU cores and the programmable GPU cores share a common path to system memory. Also present are high-speed block transfer engines, which assist in data movement between the x86 and GPU cores. Hence, data transfers never hit the system’s external bus, thereby mitigating the adverse effects of slow PCIe. In this paper, we present an empirical characterization and analysis of the effectiveness of the AMD Fusion architecture. To the best of our knowledge, this work is the first to do so. The processor built upon the Fusion architecture is called an accelerated processing unit or APU. The APU combines the general-purpose x86 cores of a CPU with programmable vector-processing engines of a GPU onto a single silicon die. We then re-visit Amdahl’s Law for today’s era of accelerated processors. Specifically, we show that the fused CPU+GPU cores enable better performance than a discrete GPU and even traditional multi-core CPU processors by reducing the parallel overhead of PCIe data transfers. To characterize the performance of the AMD Fusion archi- tecture, we use four benchmarks from the Scalable HeterOge- neous Computing (SHOC) benchmark suite [6] as well as the OpenCL PCIe bandwidth test. Via these benchmarks, we show that the Fusion architecture can overcome the PCIe bottleneck associated with a discrete GPU, though not always. For the first-generation AMD Fusion, the APU, which is a fused combination of CPU+GPU, delivers better performance than a discrete CPU+GPU combination when the amount of data to be transferred between the CPU and GPU exceeds a minimum threshold and when the amount of computation on the GPU cores is not high enough for the discrete GPU to amortize the PCIe data-transfer overhead. Our empirical results indicate that the APU improves data- transfer times by 1.7 to 6.0-fold over the discrete CPU+GPU combination. For one particular benchmark, i.e., reduction, the total execution time is 3.5-fold better on the Fusion APU than on the discrete GPU despite the latter having 20 times more GPU cores and more powerful cores at that. In turn, the improvement in data-transfer times reduces the parallel overhead, thus providing more parallelism to the application. The rest of this paper is organized as follows. Section II presents an overview of AMD GPUs and the issue of the PCIe bottleneck in discrete CPU+GPU platforms. We then in- troduce the AMD Fusion architecture and outline why it holds the promise of overcoming this bottleneck. In Section III, we re-visit Amdahl’s Law for accelerator-based processors. In Section IV, we illustrate and discuss the results of our experiments. Section V presents related work, followed by conclusions and future work in Section VI. II. BACKGROUND Here we present an overview of the AMD GPU and discuss the effect of the PCIe bottleneck, which oftentimes proves to be an obstacle towards achieving better overall application performance. We then describe the architecture of the first generation of accelerated processing unit (APU), i.e. AMD Fusion E-Series APU, and show how it holds the promise of overcoming the PCIe bottleneck. A. Overview of the AMD GPU An AMD GPU follows a classic graphics design, which is highly tuned for single-precision, floating-point arithmetic and common image operations on two-dimensional and image data. Fig. 1 provides an architectural overview. Ultra-Threaded Dispatch Processor Instruction and Control Flow General-Purpose Registers T-Stream Core Stream Cores Branch Execution Unit SIMD Engine Thread Processor Thread Queue Rasterizer Fig. 1. Overview of an AMD/ATI Stream Processor and Thread Scheduler. In this case, the compute unit is known as a SIMD engine and contains several thread processors, each containing four stream cores, along with a special-purpose core and a branch execution unit. The special-purpose (or T-Stream) core is de- signed to execute certain mathematical functions in hardware, e.g., transcendentals like sin(), cos() and tan(). Since there is only one branch execution unit for every five processing cores, any branch in the program incurs some serialization to determine the path each thread should take. The execution of divergent branches is performed in lock-step manner for all the cores present in a compute unit. In addition, the processing cores are vector processors, which means that using vector types can produce material speedup on AMD GPUs. Discrete GPUs from AMD house a large number of pro- cessing cores, ranging from 800 to 1600 cores. As a result, a humongous number of threads need to be launched in order to keep all GPU cores fully occupied. However, to run many threads, the amount of registers used per thread has to be kept to a minimum. That is, all the registers utilized per thread need to be stored in a register file, and hence, the total number of threads that can be scheduled is limited by the size of the register file, which is a generous 256 KB on the latest generation of AMD GPUs. Another unique architectural feature of AMD GPUs is the presence of a rasterizer – for working with two-dimensional matrices of threads and data. Hence, accessing scalar elements stored contiguously in memory is not the most efficient access pattern. Accessing scalar elements can be made slightly more efficient by doing so in chunks of 128 bits due to the presence of vector cores. Loading these chunks from image memory, which uses the memory layout best matched to the memory hardware on AMD GPUs, also results in significant improvement in performance. B. PCIe Bottlenecks with Discrete GPUs In Fig. 2, we demonstrate the cause of PCIe bottlenecks with a discrete GPU. As shown, the x86 host CPU can access the system memory as well as initiate functions on the GPU. However, because the GPU resides on PCIe, a DMA is required to transfer data from the system memory of the CPU to device memory of the GPU to perform any useful work. Although the GPU can execute hundreds of billions of floating-point operations per second, the current PCIe interconnects can transfer approximately only a gigaword per second [7]. Due to this limitation, it behooves the GPU application programmer to ensure high data reuse on the GPU to be able to successfully amortize the cost of slow PCIe transfers, and in turn, achieve substantial performance benefits. Thread Execution Control SIMD Engines (~500 Gflop/s) …… S y s t e m M e m o r y ( H o s t ) X86 CPU Cores Thread Processors Thread Processors Thread Processors Device Memory S y s t e m M e m o r y ( H o s t ) DMA/PCIe (~1 Gword/s) Fig. 2. Architectural Layout of a Discrete GPU. However, ensuring high data reuse in all applications may not be possible. For example, there might be applications for which the execution time on a GPU is less than the time it takes to get data onto the GPU or applications whose execution profiles consist of iterating over DMA transfers and GPU execution. For such applications, discrete GPUs may not be the appropriate to accelerate performance. Emerging architectures like AMD Fusion seek to address these issues by eliminating PCIe access to the GPU by “fusing” CPU and GPU functionality onto a single silicon die. C. AMD Fusion Architecture At the most basic level, the Fusion architecture combines general-purpose scalar and vector processor cores onto a single silicon die, thereby forming a heterogeneous computing processor. It aims to provide the “best of both worlds” scenario in the sense that scalar workloads, like word processing and web browsing, use the x86 cores whereas vector workloads, like parallel data processing, use the GPU cores. Fig. 3 depicts a block diagram of this novel architecture. The key aspect to note is that the x86 CPU cores and the vector (SIMD) engines are attached to the system memory via the same high speed bus and memory controller. This architectural artifact allows the AMD Fusion architecture to alleviate the fundamental PCIe constraint that has traditionally limited performance on a discrete GPU. Apart from the processing cores, Fusion also consists of the following system elements: memory controller, I/O controller, video decoder, display output, and bus interfaces, all on the same die. Although Fusion’s x86 cores and SIMD engines share a common bus to the system memory, the first-generation implementation of Fusion divides system memory into two parts — one that is visible to and managed by the operating system running on x86 cores and one that is managed by the software running on the SIMD engines. Therefore, even on the Fusion architecture, data has to be moved from the operating system’s portion of system memory to that portion that is visible to the SIMD engines. However, unlike discrete GPUs, where these data transfers from system memory to device memory hit PCIe, the data transfers on Fusion are expected to amount to a memcpy, as logically captured in Fig. 4. Moreover, AMD currently provides high-speed block transfer engines that move data between the x86 and SIMD memory partitions. Therefore, the Fusion architecture holds the promise of improving performance for all applications that were previously bottlenecked by PCIe transfers. Future APU architectures are expected to have these memories seamlessly merged [7], which means that there will not be a need to transfer data to and from the GPU memory at all. Programming on Fusion is facilitated by the emerging OpenCL standard. Therefore, existing applications written in OpenCL for the discrete CPU and GPU combination can be run without modification on the fused CPU+GPU of Fusion. III. REVISITING AMDAHL’S LAW We first briefly review Amdahl’s Law [8], followed by a theoretical discussion on Hill and Marty’s work on applying it to symmetric and asymmetric multi-core chips [9]. We then Thread Execution Control SIMD Engines …… System Memory X86 CPU Cores Thread Processors Thread Processors Thread Processors H i g h P e r f o r m a n c e B u s a n d M e m o r y C o n t r o l l e r Platform Interfaces ……Cores H i g h P e r f o r m a n c e B u s a n d M e m o r y C o n t r o l l e r Unified Video Decoder Fig. 3. Architectural Layout of the AMD Fusion Architecture. re-visit Amdahl’s Law, specifically for accelerators, and show that fused asymmetric CPU+GPU cores for an APU enable more parallelism in the code than discrete GPUs or traditional multi-core symmetric processors. The speedup of parallel applications on multi-processor architectures is limited by Amdahl’s Law, which implies that the speedup obtained by implementing an application in parallel is dependent upon the fraction of the workload that can be parallelized [8]. Hence, the speedup, S, for a parallel application is given by (1). S = 1 s+ p/N (1) where p = parallel fraction of the application s = serial fraction of the application, i.e., (1− p) and N = number of processors Amdahl’s law holds true in the ideal scenario for any multi- processor system if we assume the workload to be constant, i.e., strong scaling. This also makes the assumption that all the processors have the same overall computational capabilities. Device Memory System Memory (Host) PCIe Transfer (a) Discrete GPU. (x86) (SIMD Engines) memcpy System Memory (b) AMD Fusion (First Generation). Fig. 4. Description of Data Transfers. If N →∞ then, S = 1 s (2) Informally, this means that even when the serial fraction of the work is small, the maximum speedup obtainable from an infinite number of parallel processors is limited by 1/s. Hill and Marty [9] first categorize multi-core chips into three groups based on how the on-chip unit resources, or base core equivalents (BCEs), are combined to form larger processing cores. Specifically, they classify chips into symmetric, asym- metric, and dynamic multi-core chips and then theoretically analyze the attainable speedups on each platform. Symmetric chips are the traditional multi-core chips, where every processor has the same computational capability. Equa- tion (1) can be directly applied to symmetric chips with the implication that it is critical for the programmer to extract parallelism from their code. Dynamic chips are idealistic chips, where the cores can be dynamically combined to boost the performance of the serial fraction of the program, thereby providing maximum efficiency even if the code has a fairly large serial fraction. We do not study dynamic chips in this paper. However, we will revisit them in the future to study the next-generation AMD A- Series APUs, which promise to improve power efficiency by dynamically turning on and off the CPU and GPU resources depending on the application load (AMD Power Gating) [10]. On the other hand, asymmetric chips are those that have one large complex core for sequential programming and sev- eral other simpler cores that help the larger core in parallel processing. Hill and Marty show that asymmetric multi-cores offer more potential speedup than their symmetric counterparts even for lower values of p [9]. For example, they show that for p = 0.975 and N = 256, the best asymmetric speedup is 125.0, whereas the best symmetric speedup is 51.2. However, they make an idealistic assumption that the parallel fraction of the program utilizes all the available cores completely. This is only possible if there is a perfect co-scheduling mechanism that enables complete utilization of the on-chip resources. Nevertheless, it is evident that asymmetric multi-core chips are more efficient than the symmetric ones [9]. We now study Amdahl’s Law for accelerator-based systems, which can be considered to be a special type of asymmetric multi-cores, where the accelerator cores and the serial pro- cessor may be separated by PCIe. In general, Amdahl’s Law ignores the overhead incurred due to parallelizing a workload. On any multi-core processor, this overhead is largely due to setup of parallel threads, interprocessor communication, and thread rejoining. Therefore, the speedup obtained is always less than the ideal. Furthermore, the overhead incurred by parallelizing an application on an accelerator-based system, especially the GPU, is even higher because data has to be transferred over the slow PCIe. This fact is corroborated by one of our micro-benchmark results, as shown in Fig. 5. This particular micro-benchmark performs a fmad opera- tion between each element of two float-type arrays of size 96 MB each. It is executed on three different platforms, i.e., 0 50 100 150 200 250 Achieved (Fused GPU) Achieved (Discrete GPU) Achieved (Mul:core) Ideal Amdahl's Law (4 cores) Single Threaded Time (ms) Serial Time Parallel Time Overhead Fig. 5. Characterization of Parallel Overhead. a modern four-core CPU, a discrete CPU and GPU (AMD Radeon HD 5870) and a fused CPU+GPU (AMD E-Series Zacate APU). We use OpenMP as the parallel programming platform for the four-core processor and OpenCL for the Radeon GPU and the Zacate APU. The figure shows the total execution time as the sum of (i) the execution time of the serial part, (ii) the execution time of the parallel part, and (iii) the overhead incurred due to parallelization, i.e., device buffer creation, destruction, and buffer transfer. (We have not included the constant OpenCL setup time, i.e. kernel compilation, program and platform initialization.) The execution time of the ‘parallel part’ in the case of the discrete GPU and APU is the kernel execution time. The single-thread implementation depicts the serial and parallel fractions of the code, while in case of the ideal Amdahl’s law, the parallel part is sped up by four-fold (on a four- core CPU) with zero overhead. While the actual multi-core implementation, parallelized using OpenMP, does contain par- allel overhead, it is negligible when compared to the overhead incurred due to parallelization on the accelerated platforms. For the discrete GPU, however, the parallel overhead is so significant that it is more than the sum of execution times of the serial and parallel parts. So, while the execution time of the parallel part on the discrete GPU is substantially better than that on the multi-core CPU, the overhead is so much more that it does not make the micro-benchmark amenable for GPU processing. It also demonstrates the bottleneck caused by communication over PCIe. Lastly, the APU (or fused CPU+GPU) does assist in re- ducing the parallel overhead. However, due to the presence of computationally less powerful SIMD cores, the execution time of the parallel part is longer than on the discrete GPU. To apply Amdahl’s Law to accelerator-based platforms, we model the following two factors: • Accelerated Parallel Fraction

                    本文档为【On the Ef&#64257;cacy of a Fused CPU+GPU Processor】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

On the Ef&#64257;cacy of a Fused CPU+GPU Processor

你可能还喜欢

On the Efﬁcacy of a Fused CPU+GPU Processor