On the Efficacy of a Fused CPU+GPU Processor
(or APU) for Parallel Computing
Mayank Daga, Ashwin M. Aji, and Wu-chun Feng
Dept. of Computer Science
Virginia Tech
Blacksburg, USA
{mdaga, aaji, feng}@cs.vt.edu
Abstract—The graphics processing unit (GPU) has made sig-
nificant strides as an accelerator in parallel computing. However,
because the GPU has resided out on PCIe as a discrete device,
the performance of GPU applications can be bottlenecked by
data transfers between the CPU and GPU over PCIe. Emerging
heterogeneous computing architectures that “fuse” the function-
ality of the CPU and GPU, e.g., AMD Fusion and Intel Knights
Ferry, hold the promise of addressing the PCIe bottleneck.
In this paper, we empirically characterize and analyze the
efficacy of AMD Fusion, an architecture that combines general-
purpose x86 cores and programmable accelerator cores on the
same silicon die. We characterize its performance via a set of
micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks
(e.g., reduction), and actual applications (e.g., molecular dynam-
ics). Depending on the benchmark, our results show that Fusion
produces a 1.7 to 6.0-fold improvement in the data-transfer time,
when compared to a discrete GPU. In turn, this improvement in
data-transfer performance can significantly enhance application
performance. For example, running a reduction benchmark on
AMD Fusion with its mere 80 GPU cores improves performance
by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with
its 1600 more powerful GPU cores.
Keywords-AMD Fusion; graphics processing unit; GPU;
GPGPU; accelerated processing unit; APU; OpenCL; perfor-
mance evaluation; benchmarking; heterogeneous computing;
I. INTRODUCTION
The widespread adoption of compute-capable graphics pro-
cessing units (GPUs) in desktops and workstations has made
them attractive as accelerators for high-performance parallel
computing [1]. Their increased popularity has been due in
part to a unique amalgamation of performance, power, and
energy efficiency. In fact, three out of the top five fastest
supercomputers in the world, according to the Top500, employ
GPUs [2].
A closer look at the Top500 list, however, reveals that the
supercomputers powered by GPUs attain only ∼50% of their
theoretical peak performance, whereas non-GPU-powered su-
percomputers attain ∼78% of the theoretical peak [2]. This
implies that there are certain aspects of GPUs that limit
their performance for Linpack, the benchmark used to rank
supercomputers on the Top500. Since GPUs have traditionally
resided on PCI Express (PCIe), additional overhead costs are
incurred for host-to-GPU data transfers and vice versa. As a
consequence, GPU applications are oftentimes bottlenecked by
the PCIe data transfers. Thus, GPUs are not a panacea [3]–[5].
With the emergence of heterogeneous computing architec-
tures that “fuse” the functionality of the CPU and GPU onto
the same die, e.g., AMD Fusion and Intel Knights Ferry,
there is an expectation that the PCIe bottlenecks would be
addressed. In these architectures, the x86 CPU cores and the
programmable GPU cores share a common path to system
memory. Also present are high-speed block transfer engines,
which assist in data movement between the x86 and GPU
cores. Hence, data transfers never hit the system’s external
bus, thereby mitigating the adverse effects of slow PCIe.
In this paper, we present an empirical characterization and
analysis of the effectiveness of the AMD Fusion architecture.
To the best of our knowledge, this work is the first to do
so. The processor built upon the Fusion architecture is called
an accelerated processing unit or APU. The APU combines
the general-purpose x86 cores of a CPU with programmable
vector-processing engines of a GPU onto a single silicon die.
We then re-visit Amdahl’s Law for today’s era of accelerated
processors. Specifically, we show that the fused CPU+GPU
cores enable better performance than a discrete GPU and even
traditional multi-core CPU processors by reducing the parallel
overhead of PCIe data transfers.
To characterize the performance of the AMD Fusion archi-
tecture, we use four benchmarks from the Scalable HeterOge-
neous Computing (SHOC) benchmark suite [6] as well as the
OpenCL PCIe bandwidth test. Via these benchmarks, we show
that the Fusion architecture can overcome the PCIe bottleneck
associated with a discrete GPU, though not always.
For the first-generation AMD Fusion, the APU, which is a
fused combination of CPU+GPU, delivers better performance
than a discrete CPU+GPU combination when the amount of
data to be transferred between the CPU and GPU exceeds a
minimum threshold and when the amount of computation on
the GPU cores is not high enough for the discrete GPU to
amortize the PCIe data-transfer overhead.
Our empirical results indicate that the APU improves data-
transfer times by 1.7 to 6.0-fold over the discrete CPU+GPU
combination. For one particular benchmark, i.e., reduction,
the total execution time is 3.5-fold better on the Fusion APU
than on the discrete GPU despite the latter having 20 times
more GPU cores and more powerful cores at that. In turn,
the improvement in data-transfer times reduces the parallel
overhead, thus providing more parallelism to the application.
The rest of this paper is organized as follows. Section II
presents an overview of AMD GPUs and the issue of the
PCIe bottleneck in discrete CPU+GPU platforms. We then in-
troduce the AMD Fusion architecture and outline why it holds
the promise of overcoming this bottleneck. In Section III,
we re-visit Amdahl’s Law for accelerator-based processors.
In Section IV, we illustrate and discuss the results of our
experiments. Section V presents related work, followed by
conclusions and future work in Section VI.
II. BACKGROUND
Here we present an overview of the AMD GPU and discuss
the effect of the PCIe bottleneck, which oftentimes proves
to be an obstacle towards achieving better overall application
performance. We then describe the architecture of the first
generation of accelerated processing unit (APU), i.e. AMD
Fusion E-Series APU, and show how it holds the promise of
overcoming the PCIe bottleneck.
A. Overview of the AMD GPU
An AMD GPU follows a classic graphics design, which
is highly tuned for single-precision, floating-point arithmetic
and common image operations on two-dimensional and image
data. Fig. 1 provides an architectural overview.
Ultra-Threaded
Dispatch Processor
Instruction and Control Flow
General-Purpose Registers
T-Stream
Core
Stream
Cores
Branch
Execution
Unit
SIMD
Engine
Thread
Processor
Thread
Queue
Rasterizer
Fig. 1. Overview of an AMD/ATI Stream Processor and Thread Scheduler.
In this case, the compute unit is known as a SIMD engine
and contains several thread processors, each containing four
stream cores, along with a special-purpose core and a branch
execution unit. The special-purpose (or T-Stream) core is de-
signed to execute certain mathematical functions in hardware,
e.g., transcendentals like sin(), cos() and tan(). Since there
is only one branch execution unit for every five processing
cores, any branch in the program incurs some serialization to
determine the path each thread should take. The execution of
divergent branches is performed in lock-step manner for all the
cores present in a compute unit. In addition, the processing
cores are vector processors, which means that using vector
types can produce material speedup on AMD GPUs.
Discrete GPUs from AMD house a large number of pro-
cessing cores, ranging from 800 to 1600 cores. As a result, a
humongous number of threads need to be launched in order
to keep all GPU cores fully occupied. However, to run many
threads, the amount of registers used per thread has to be kept
to a minimum. That is, all the registers utilized per thread need
to be stored in a register file, and hence, the total number
of threads that can be scheduled is limited by the size of
the register file, which is a generous 256 KB on the latest
generation of AMD GPUs.
Another unique architectural feature of AMD GPUs is the
presence of a rasterizer – for working with two-dimensional
matrices of threads and data. Hence, accessing scalar elements
stored contiguously in memory is not the most efficient access
pattern. Accessing scalar elements can be made slightly more
efficient by doing so in chunks of 128 bits due to the
presence of vector cores. Loading these chunks from image
memory, which uses the memory layout best matched to the
memory hardware on AMD GPUs, also results in significant
improvement in performance.
B. PCIe Bottlenecks with Discrete GPUs
In Fig. 2, we demonstrate the cause of PCIe bottlenecks
with a discrete GPU. As shown, the x86 host CPU can
access the system memory as well as initiate functions on
the GPU. However, because the GPU resides on PCIe, a
DMA is required to transfer data from the system memory
of the CPU to device memory of the GPU to perform any
useful work. Although the GPU can execute hundreds of
billions of floating-point operations per second, the current
PCIe interconnects can transfer approximately only a gigaword
per second [7]. Due to this limitation, it behooves the GPU
application programmer to ensure high data reuse on the GPU
to be able to successfully amortize the cost of slow PCIe
transfers, and in turn, achieve substantial performance benefits.
Thread Execution Control
SIMD Engines (~500 Gflop/s)
……
S
y s
t e
m
M
e
m
o
r y
(
H
o
s t
)
X86
CPU
Cores
Thread
Processors
Thread
Processors
Thread
Processors
Device Memory
S
y s
t e
m
M
e
m
o
r y
(
H
o
s t
)
DMA/PCIe
(~1 Gword/s)
Fig. 2. Architectural Layout of a Discrete GPU.
However, ensuring high data reuse in all applications may
not be possible. For example, there might be applications for
which the execution time on a GPU is less than the time it
takes to get data onto the GPU or applications whose execution
profiles consist of iterating over DMA transfers and GPU
execution. For such applications, discrete GPUs may not be
the appropriate to accelerate performance.
Emerging architectures like AMD Fusion seek to address
these issues by eliminating PCIe access to the GPU by
“fusing” CPU and GPU functionality onto a single silicon die.
C. AMD Fusion Architecture
At the most basic level, the Fusion architecture combines
general-purpose scalar and vector processor cores onto a
single silicon die, thereby forming a heterogeneous computing
processor. It aims to provide the “best of both worlds” scenario
in the sense that scalar workloads, like word processing and
web browsing, use the x86 cores whereas vector workloads,
like parallel data processing, use the GPU cores.
Fig. 3 depicts a block diagram of this novel architecture.
The key aspect to note is that the x86 CPU cores and the
vector (SIMD) engines are attached to the system memory
via the same high speed bus and memory controller. This
architectural artifact allows the AMD Fusion architecture to
alleviate the fundamental PCIe constraint that has traditionally
limited performance on a discrete GPU. Apart from the
processing cores, Fusion also consists of the following system
elements: memory controller, I/O controller, video decoder,
display output, and bus interfaces, all on the same die.
Although Fusion’s x86 cores and SIMD engines share
a common bus to the system memory, the first-generation
implementation of Fusion divides system memory into two
parts — one that is visible to and managed by the operating
system running on x86 cores and one that is managed by
the software running on the SIMD engines. Therefore, even
on the Fusion architecture, data has to be moved from the
operating system’s portion of system memory to that portion
that is visible to the SIMD engines. However, unlike discrete
GPUs, where these data transfers from system memory to
device memory hit PCIe, the data transfers on Fusion are
expected to amount to a memcpy, as logically captured in
Fig. 4. Moreover, AMD currently provides high-speed block
transfer engines that move data between the x86 and SIMD
memory partitions. Therefore, the Fusion architecture holds
the promise of improving performance for all applications that
were previously bottlenecked by PCIe transfers. Future APU
architectures are expected to have these memories seamlessly
merged [7], which means that there will not be a need to
transfer data to and from the GPU memory at all.
Programming on Fusion is facilitated by the emerging
OpenCL standard. Therefore, existing applications written in
OpenCL for the discrete CPU and GPU combination can be
run without modification on the fused CPU+GPU of Fusion.
III. REVISITING AMDAHL’S LAW
We first briefly review Amdahl’s Law [8], followed by a
theoretical discussion on Hill and Marty’s work on applying
it to symmetric and asymmetric multi-core chips [9]. We then
Thread Execution Control
SIMD Engines
……
System Memory
X86 CPU
Cores
Thread
Processors
Thread
Processors
Thread
Processors
H
i g
h
P
e
r f
o
r m
a
n
c e
B
u
s
a
n
d
M
e
m
o
r y
C
o
n
t r
o
l l
e
r
Platform Interfaces
……Cores
H
i g
h
P
e
r f
o
r m
a
n
c e
B
u
s
a
n
d
M
e
m
o
r y
C
o
n
t r
o
l l
e
r
Unified Video Decoder
Fig. 3. Architectural Layout of the AMD Fusion Architecture.
re-visit Amdahl’s Law, specifically for accelerators, and show
that fused asymmetric CPU+GPU cores for an APU enable
more parallelism in the code than discrete GPUs or traditional
multi-core symmetric processors.
The speedup of parallel applications on multi-processor
architectures is limited by Amdahl’s Law, which implies
that the speedup obtained by implementing an application in
parallel is dependent upon the fraction of the workload that
can be parallelized [8]. Hence, the speedup, S, for a parallel
application is given by (1).
S =
1
s+ p/N
(1)
where p = parallel fraction of the application
s = serial fraction of the application, i.e., (1− p)
and N = number of processors
Amdahl’s law holds true in the ideal scenario for any multi-
processor system if we assume the workload to be constant,
i.e., strong scaling. This also makes the assumption that all the
processors have the same overall computational capabilities.
Device Memory
System Memory
(Host)
PCIe Transfer
(a) Discrete GPU.
(x86) (SIMD Engines)
memcpy
System Memory
(b) AMD Fusion (First Generation).
Fig. 4. Description of Data Transfers.
If N →∞ then,
S =
1
s
(2)
Informally, this means that even when the serial fraction of
the work is small, the maximum speedup obtainable from an
infinite number of parallel processors is limited by 1/s.
Hill and Marty [9] first categorize multi-core chips into three
groups based on how the on-chip unit resources, or base core
equivalents (BCEs), are combined to form larger processing
cores. Specifically, they classify chips into symmetric, asym-
metric, and dynamic multi-core chips and then theoretically
analyze the attainable speedups on each platform.
Symmetric chips are the traditional multi-core chips, where
every processor has the same computational capability. Equa-
tion (1) can be directly applied to symmetric chips with the
implication that it is critical for the programmer to extract
parallelism from their code.
Dynamic chips are idealistic chips, where the cores can
be dynamically combined to boost the performance of the
serial fraction of the program, thereby providing maximum
efficiency even if the code has a fairly large serial fraction. We
do not study dynamic chips in this paper. However, we will
revisit them in the future to study the next-generation AMD A-
Series APUs, which promise to improve power efficiency by
dynamically turning on and off the CPU and GPU resources
depending on the application load (AMD Power Gating) [10].
On the other hand, asymmetric chips are those that have
one large complex core for sequential programming and sev-
eral other simpler cores that help the larger core in parallel
processing. Hill and Marty show that asymmetric multi-cores
offer more potential speedup than their symmetric counterparts
even for lower values of p [9]. For example, they show that
for p = 0.975 and N = 256, the best asymmetric speedup is
125.0, whereas the best symmetric speedup is 51.2. However,
they make an idealistic assumption that the parallel fraction of
the program utilizes all the available cores completely. This is
only possible if there is a perfect co-scheduling mechanism
that enables complete utilization of the on-chip resources.
Nevertheless, it is evident that asymmetric multi-core chips
are more efficient than the symmetric ones [9].
We now study Amdahl’s Law for accelerator-based systems,
which can be considered to be a special type of asymmetric
multi-cores, where the accelerator cores and the serial pro-
cessor may be separated by PCIe. In general, Amdahl’s Law
ignores the overhead incurred due to parallelizing a workload.
On any multi-core processor, this overhead is largely due to
setup of parallel threads, interprocessor communication, and
thread rejoining. Therefore, the speedup obtained is always
less than the ideal. Furthermore, the overhead incurred by
parallelizing an application on an accelerator-based system,
especially the GPU, is even higher because data has to be
transferred over the slow PCIe. This fact is corroborated by
one of our micro-benchmark results, as shown in Fig. 5.
This particular micro-benchmark performs a fmad opera-
tion between each element of two float-type arrays of size
96 MB each. It is executed on three different platforms, i.e.,
0
50
100
150
200
250
Achieved
(Fused
GPU)
Achieved
(Discrete
GPU)
Achieved
(Mul:core)
Ideal
Amdahl's
Law
(4
cores)
Single
Threaded
Time
(ms)
Serial
Time
Parallel
Time
Overhead
Fig. 5. Characterization of Parallel Overhead.
a modern four-core CPU, a discrete CPU and GPU (AMD
Radeon HD 5870) and a fused CPU+GPU (AMD E-Series
Zacate APU). We use OpenMP as the parallel programming
platform for the four-core processor and OpenCL for the
Radeon GPU and the Zacate APU. The figure shows the
total execution time as the sum of (i) the execution time of
the serial part, (ii) the execution time of the parallel part,
and (iii) the overhead incurred due to parallelization, i.e.,
device buffer creation, destruction, and buffer transfer. (We
have not included the constant OpenCL setup time, i.e. kernel
compilation, program and platform initialization.)
The execution time of the ‘parallel part’ in the case of the
discrete GPU and APU is the kernel execution time. The
single-thread implementation depicts the serial and parallel
fractions of the code, while in case of the ideal Amdahl’s
law, the parallel part is sped up by four-fold (on a four-
core CPU) with zero overhead. While the actual multi-core
implementation, parallelized using OpenMP, does contain par-
allel overhead, it is negligible when compared to the overhead
incurred due to parallelization on the accelerated platforms.
For the discrete GPU, however, the parallel overhead is so
significant that it is more than the sum of execution times of
the serial and parallel parts. So, while the execution time of
the parallel part on the discrete GPU is substantially better
than that on the multi-core CPU, the overhead is so much
more that it does not make the micro-benchmark amenable for
GPU processing. It also demonstrates the bottleneck caused by
communication over PCIe.
Lastly, the APU (or fused CPU+GPU) does assist in re-
ducing the parallel overhead. However, due to the presence of
computationally less powerful SIMD cores, the execution time
of the parallel part is longer than on the discrete GPU.
To apply Amdahl’s Law to accelerator-based platforms, we
model the following two factors:
• Accelerated Parallel Fraction
本文档为【On the Efficacy of a Fused CPU+GPU Processor】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。