An Analytical Model for a GPU Architecture with
Memory-level and Thread-level Parallelism Awareness
Sunpyo Hong
Electrical and Computer Engineering
Georgia Institute of Technology
shong9@gatech.edu
Hyesoon Kim
School of Computer Science
Georgia Institute of Technology
hyesoon@cc.gatech.edu
ABSTRACT
GPU architectures are increasingly important in the multi-core era
due to their high number of parallel processors. Programming thou-
sands of massively parallel threads is a big challenge for software
engineers, but understanding the performance bottlenecks of those
parallel programs on GPU architectures to improve application per-
formance is even more difficult. Current approaches rely on pro-
grammers to tune their applications by exploiting the design space
exhaustively without fully understanding the performance charac-
teristics of their applications.
To provide insights into the performance bottlenecks of parallel
applications on GPU architectures, we propose a simple analytical
model that estimates the execution time of massively parallel pro-
grams. The key component of our model is estimating the number
of parallel memory requests (we call this the memory warp paral-
lelism) by considering the number of running threads and memory
bandwidth. Based on the degree of memory warp parallelism, the
model estimates the cost of memory requests, thereby estimating
the overall execution time of a program. Comparisons between
the outcome of the model and the actual execution time in several
GPUs show that the geometric mean of absolute error of our model
on micro-benchmarks is 5.4% and on GPU computing applications
is 13.3%. All the applications are written in the CUDA program-
ming language.
Categories and Subject Descriptors
C.1.4 [Processor Architectures]: Parallel Architectures
; C.4 [Performance of Systems]: Modeling techniques
; C.5.3 [Computer System Implementation]: Microcomputers
General Terms
Measurement, Performance
Keywords
Analytical model, CUDA, GPU architecture, Memory level paral-
lelism, Warp level parallelism, Performance estimation
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’09, June 20–24, 2009, Austin, Texas, USA.
Copyright 2009 ACM 978-1-60558-526-0/09/06 ...$5.00.
1. INTRODUCTION
The increasing computing power of GPUs gives them consid-
erably higher peak computing power than CPUs. For example,
NVIDIA’s GTX280 GPUs [3] provide 933 Gflop/s with 240 cores,
while Intel’s Core2Quad processors [2] deliver only 100 Gflop/s.
Intel’s next generation of graphics processors will support more
than 900 Gflop/s [26]. AMD/ATI’s latest GPU (HD4870) provides
1.2 Tflop/s [1]. However, even though hardware is providing high
performance computing, writing parallel programs to take full ad-
vantage of this high performance computing power is still a big
challenge.
Recently, there have been new programming languages that aim
to reduce programmers’ burden in writing parallel applications for
the GPUs such as Brook+ [5], CUDA [22], and OpenCL [16].
However, even with these newly developed programming languages,
programmers still need to spend enormous time and effort to op-
timize their applications to achieve better performance [24]. Al-
though the GPGPU community [11] provides general guidelines
for optimizing applications using CUDA, clearly understanding var-
ious features of the underlying architecture and the associated per-
formance bottlenecks in their applications is still remaining home-
work for programmers. Therefore, programmers might need to
vary all the combinations to find the best performing configura-
tions [24].
To provide insight into performance bottlenecks in massively
parallel architectures, especially GPU architectures, we propose a
simple analytical model. The model can be used statically with-
out executing an application. The basic intuition of our analytical
model is that estimating the cost of memory operations is the key
component of estimating the performance of parallel GPU appli-
cations. The execution time of an application is dominated by the
latency of memory instructions, but the latency of each memory op-
eration can be hidden by executing multiple memory requests con-
currently. By using the number of concurrently running threads and
the memory bandwidth consumption, we estimate how many mem-
ory requests can be executed concurrently, which we call memory
warp1 parallelism (MWP).We also introduce computation warp
parallelism (CWP). CWP represents how much computation can
be done by other warps while one warp is waiting for memory val-
ues. CWP is similar to a metric, arithmetic intensity2[23] in the
GPGPU community. Using both MWP and CWP, we estimate ef-
fective costs of memory requests, thereby estimating the overall
execution time of a program.
We evaluate our analytical model based on the CUDA [20, 22]
1A warp is a batch of threads that are internally executed together
by the hardware. Section 2 describes a warp.
2Arithmetic intensity is defined as math operations per memory
operation.
programming language, which is C with extensions for parallel
threads. We compare the results of our analytical model with the
actual execution time on several GPUs. Our results show that the
geometric mean of absolute error of our model on micro-benchmarks
is 5.4% and on the Merge benchmarks [17]3 is 13.3%
The contributions of our work are as follows:
1. To the best of our knowledge, we propose the first analytical
model for the GPU architecture. This can be easily extended
to other multithreaded architectures as well.
2. We propose two new metrics, MWP and CWP, to represent
the degree of warp level parallelism that provide key insights
identifying performance bottlenecks.
2. BACKGROUND AND MOTIVATION
We provide a brief background on the GPU architecture and pro-
gramming model that we modeled. Our analytical model is based
on the CUDA programming model and the NVIDIA Tesla archi-
tecture [3, 8, 20] used in the GeForce 8-series GPUs.
2.1 Background on the CUDA Programming
Model
The CUDA programming model is similar in style to a single-
program multiple-data (SPMD) software model. The GPU is treated
as a coprocessor that executes data-parallel kernel functions.
CUDA provides three key abstractions, a hierarchy of thread
groups, shared memories, and barrier synchronization. Threads
have a three level hierarchy. A grid is a set of thread blocks that
execute a kernel function. Each grid consists of blocks of threads.
Each block is composed of hundreds of threads. Threads within one
block can share data using shared memory and can be synchronized
at a barrier. All threads within a block are executed concurrently
on a multithreaded architecture.
The programmer specifies the number of threads per block, and
the number of blocks per grid. A thread in the CUDA program-
ming language is much lighter weight than a thread in traditional
operating systems. A thread in CUDA typically processes one data
element at a time. The CUDA programming model has two shared
read-write memory spaces, the shared memory space and the global
memory space. The shared memory is local to a block and the
global memory space is accessible by all blocks. CUDA also pro-
vides two read-only memory spaces, the constant space and the
texture space, which reside in external DRAM, and are accessed
via read-only caches.
2.2 Background on the GPU Architecture
Figure 1 shows an overview of the GPU architecture. The GPU
architecture consists of a scalable number of streaming multipro-
cessors (SMs), each containing eight streaming processor (SP) cores,
two special function units (SFUs), a multithreaded instruction fetch
and issue unit, a read-only constant cache, and a 16KB read/write
shared memory [8].
The SM executes a batch of 32 threads together called a warp.
Executing a warp instruction applies the instruction to 32 threads,
similar to executing a SIMD instruction like an SSE instruction [14]
in X86. However, unlike SIMD instructions, the concept of warp is
not exposed to the programmers, rather programmers write a pro-
gram for one thread, and then specify the number of parallel threads
in a block, and the number of blocks in a kernel grid. The Tesla ar-
chitecture forms a warp using a batch of 32 threads [13, 9] and in
the rest of the paper we also use a warp as a batch of 32 threads.
3The Merge benchmarks consist of several media processing appli-
cations.
Stream
Processor
Stream
Processor
Stream
Processor
Stream
Processor
PC
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
th
re
ad
...
...
Global Memory (Device Memory)
SIM
D
Execution U
nit
block
... ...
...
Streamming
Multiprocessor
(Multithreaded
processor)
Streamming
Multiprocessor
(Multithreaded
processor)
Streamming
Multiprocessor
(Multithreaded
processor)
Streamming
Multiprocessor
(Multithreaded
processor)
th
re
ad
th
re
ad
th
re
ad
block
...
warp
...
...
block
... ...
...
block
... ...
...
warp warp warp warp warpwarpwarp
I−cache
Decoder
Shared Memory
Interconnection Network
Figure 1: An overview of the GPU architecture
All the threads in one block are executed on one SM together.
One SM can also have multiple concurrently running blocks. The
number of blocks that are running on one SM is determined by the
resource requirements of each block such as the number of registers
and shared memory usage. The blocks that are running on one SM
at a given time are called active blocks in this paper. Since one
block typically has several warps (the number of warps is the same
as the number of threads in a block divided by 32), the total number
of active warps per SM is equal to the number of warps per block
times the number of active blocks.
The shared memory is implemented within each SM multipro-
cessor as an SRAM and the global memory is part of the offchip
DRAM. The shared memory has very low access latency (almost
the same as that of register) and high bandwidth. However, since a
warp of 32 threads access the shared memory together, when there
is a bank conflict within a warp, accessing the shared memory takes
multiple cycles.
2.3 Coalesced and Uncoalesced Memory Ac-
cesses
The SM processor executes one warp at one time, and sched-
ules warps in a time-sharing fashion. The processor has enough
functional units and register read/write ports to execute 32 threads
(i.e. one warp) together. Since an SM has only 8 functional units,
executing 32 threads takes 4 SM processor cycles for computation
instructions.4
When the SM processor executes a memory instruction, it gen-
erates memory requests and switches to another warp until all the
memory values in the warp are ready. Ideally, all the memory ac-
cesses within a warp can be combined into one memory transac-
tion. Unfortunately, that depends on the memory access pattern
within a warp. If the memory addresses are sequential, all of the
memory requests within a warp can be coalesced into a single mem-
ory transaction. Otherwise, each memory address will generate a
different transaction. Figure 2 illustrates two cases. The CUDA
manual [22] provides detailed algorithms to identify types of co-
alesced/uncoalesced memory accesses. If memory requests in a
warp are uncoalesced, the warp cannot be executed until all mem-
ory transactions from the same warp are serviced, which takes sig-
nificantly longer than waiting for only one memory request (coa-
lesced case).
4In this paper, a computation instruction means a non-memory in-
struction.
Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 32
(a)
A Single Memory Transaction
(b)
Addr 1 Addr 2 Addr 3 Addr 31 Addr 32
Multiple Memory Transactions
��
���
�� ��
���
�� ��
���
�� ��
���
� ��
���
�
��
���
�� ��
���
���
��
���
�� ��
���
�� ��
���
�� ��
���
��� ��
���
���
Figure 2: Memory requests from a single warp. (a) coalesced
memory access (b) uncoalesced memory access
2.4 Motivating Example
To motivate the importance of a static performance analysis on
the GPU architecture, we show an example of performance differ-
ences from three different versions of the same algorithm in Fig-
ure 3. The SVM benchmark is a kernel extracted from a face clas-
sification algorithm [28]. The performance of applications is mea-
sured on NVIDIA QuadroFX5600 [4]. There are three different
optimized versions of the same SVM algorithm: Naive, Constant,
and Constant+Optimized. Naive uses only the global memory,
Constant uses the cached read-only constant memory5, and Con-
stant+Optimized also optimizes memory accesses6 on top of using
the constant memory. Figure 3 shows the execution time when the
number of threads per block is varied. In this example, the number
of blocks is fixed so the number of threads per block determines the
total number of threads in a program. The performance improve-
ment of Constant+Optimized and that of Constant over the Naive
implementation are 24.36x and 1.79x respectively. Even though
the performance of each version might be affected by the number
of threads, once the number of threads exceeds 64, the performance
does not vary significantly.
800
1000
1200
1400
E
xe
cu
ti
o
n
T
im
e
(
m
s)
0
200
400
600
4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484
E
xe
cu
ti
o
n
T
im
e
(
m
s)
THREADS PER BLOCK
Naïve Constant Constant +
Optimized
Figure 3: Optimization impacts on SVM
Figure 4 shows SM processor occupancy [22] for the three cases.
The SM processor occupancy indicates the resource utilization, which
has been widely used to optimize GPU computing applications. It
is calculated based on the resource requirements for a given pro-
gram. Typically, high occupancy (the max value is 1) is better
for performance since many actively running threads would more
likely hide the DRAM memory access latency. However, SM pro-
cessor occupancy does not sufficiently estimate the performance
5The benefits of using the constant memory are (1) it has an on-
chip cache per SM and (2) using the constant memory can reduce
register usage, which might increase the number of running blocks
in one SM.
6The programmer optimized the code to have coalesced memory
accesses instead of uncoalesced memory accesses.
0.5
0.6
0.7
0.8
0.9
1
O
c
c
u
p
a
n
c
y
0
0.1
0.2
0.3
0.4
0.5
4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484
O
c
c
u
p
a
n
c
y
THREADS PER BLOCK
Naïve
Constant
Constant+Optimized
Figure 4: Occupancy values of SVM
improvement as shown in Figure 4. First, when the number of
threads per block is less than 64, all three cases show the same
occupancy values even though the performances of 3 cases are dif-
ferent. Second, even though SM processor occupancy is improved,
for some cases, there is no performance improvement. For exam-
ple, the performance of Constant is not improved at all even though
the SM processor occupancy is increased from 0.35 to 1. Hence, we
need other metrics to differentiate the three cases and to understand
what the critical component of performance is.
3. ANALYTICAL MODEL
3.1 Introduction to MWP and CWP
The GPU architecture is a multithreaded architecture. Each SM
can execute multiple warps in a time-sharing fashion while one or
more warps are waiting for memory values. As a result, the ex-
ecution cost of warps that are executed concurrently can be hid-
den. The key component of our analytical model is finding out how
many memory requests can be serviced and how many warps can
be executed together while one warp is waiting for memory values.
To represent the degree of warp parallelism, we introduce two
metrics, MWP (Memory Warp Parallelism) and CWP (Computa-
tion Warp Parallelism). MWP represents the maximum number of
warps per SM that can access the memory simultaneously during
the time period from right after the SM processor executes a mem-
ory instruction from one warp (therefore, memory requests are just
sent to the memory system) until all the memory requests from the
same warp are serviced (therefore, the processor can execute the
next instruction from that warp). The warp that is waiting for mem-
ory values is called a memory warp in this paper. The time period
from right after one warp sent memory requests until all the mem-
ory requests from the same warp are serviced is called one memory
warp waiting period. CWP represents the number of warps that the
SM processor can execute during one memory warp waiting pe-
riod plus one. A value one is added to include the warp itself that
is waiting for memory values. (This means that CWP is always
greater than or equal to 1.)
MWP is related to how much memory parallelism in the system.
MWP is determined by the memory bandwidth, memory bank par-
allelism and the number of running warps per SM. MWP plays a
very important role in our analytical model. When MWP is higher
than 1, the cost of memory access cycles from (MWP-1) number
of warps is all hidden, since they are all accessing the memory sys-
tem together. The detailed algorithm of calculating MWP will be
described in Section 3.3.1.
CWP is related to the program characteristics. It is similar to
an arithmetic intensity, but unlike arithmetic intensity, higher CWP
means less computation per memory access. CWP also considers
timing information but arithmetic intensity does not consider tim-
ing information. CWP is mainly used to decide whether the total
execution time is dominated by computation cost or memory access
cost. When CWP is greater than MWP, the execution cost is domi-
nated by memory access cost. However, when MWP is greater than
CWP, the execution cost is dominated by computation cost. How
to calculate CWP will be described in Section 3.3.2.
3.2 The Cost of Executing Multiple Warps in
the GPU architecture
To explain how executing multiple warps in each SM affects
the total execution time, we will illustrate several scenarios in Fig-
ures 5, 6, 7 and 8. A computation period indicates the period when
instructions from one warp are executed on the SM processor. A
memory waiting period indicates the period when memory requests
are being serviced. The numbers inside the computation period
boxes and memory waiting period boxes in Figures 5, 6, 7 and 8
indicate a warp identification number.
3.2.1 CWP is Greater than MWP
� � � � � � � � �
%CUG�� %CUG��
�� � � � �
� � � ��
�
�
�
� � � �
� ��
�
�
�
�
� � ,GOH�F\FOHV
�
�
�
�
�
�
� &RPSXWDWLRQ � � 0HPRU\
&:3 � 0:3 �
� &RPSXWDWLRQ � � 0HPRU\
0HPRU\
:DLWLQJ�SHULRG
C�
D�
��&RPSXWDWLRQ�����0HPRU\��&RPSXWDWLRQ�����0HPRU\
�UV %QORWVCVKQP�RGTKQF��UV /GOQT[�RGTKQF�
�PF %QORWVCVKQP RGTKQF�PF /GOQT[ RGTKQF �PF %QORWVCVKQP�RGTKQF��PF /GOQT[�RGTKQF�
Figure 5: Total execution time when CWP is greater than
MWP: (a) 8 warps (b) 4 warps
For Case 1 in Figure 5a, we assume that all the computation pe-
riods and memory waiting periods are from different warps. The
system can service two memory warps simultaneously. Since one
computation period is roughly one third of one memory waiting
warp period, the processor can finish 3 warps’ computation peri-
ods during one memory waiting warp period. (i.e., MWP is 2 and
CWP is 4 for this case.) As a result, the 6 computation periods are
completely overlapped with other memory waiting periods. Hence,
only 2 computations and 4 memory waiting periods contribute to
the total execution cycles.
For Case 2 in Figure 5b, there are four warps a
本文档为【AnAnalytical ModelforaGPUArchitecturewithMemorylevel andThreadlevelParallelismAwareness】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。