CUDA Basics
Murphy Stein
New York University
Overview
● Device Architecture
● CUDA Programming Model
● Matrix Transpose in CUDA
● Further Reading
What is CUDA?
CUDA stands for:
”Compute Unified Device Architecture”
It is 2 things:
1. Device Architecture Specification
2. A small extension to C
= New Syntax + Builtin Variables – Restrictions + Libraries
Device Architecture: Streaming Multiprocessor (SM)
1 SM contains 8 scalar cores
● Up to 8 cores can run
simulatenously
● Each core executes identical
instruction set, or sleeps
● SM schedules instructions
across cores with 0 overhead
● Up to 32 threads may be
scheduled at a time, called a
warp, but max 24 warps active
in 1 SM
● Threadlevel memorysharing
supported via Shared Memory
● Register memory is local to
thread, and divided amongst all
blocks on SM
Registers 8KB
SM
Instruction Fetch/Dispatch
...
Streaming
Core
#1
Streaming
Core
#2
Streaming
Core
#8
Streaming
Core
#3
Shared Memory 16KB
Texture Memory
Cache 58 KB
Constant Memory
Cache 8KB
Transparent Scalability
• Hardware is free to assigns blocks to any
processor at any time
– A kernel scales across any number of
parallel processors
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
SM Warp Scheduling
• SM hardware implements zero
overhead Warp scheduling
– Warps whose next instruction has its
operands ready for consumption are
eligible for execution
– Eligible Warps are selected for
execution on a prioritized scheduling
policy
– All threads in a Warp execute the
same instruction when selected
• 4 clock cycles needed to dispatch
the same instruction for all threads
in a Warp in G80
– If one global memory access is
needed for every 4 instructions
– A minimal of 13 Warps are needed to
fully tolerate 200cycle memory
latency
warp 8 instruction 11
SM multithreaded
Warp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
Streaming
Multiprocessor
(SM)
Device Architecture
Host
Block Execution Manager
Streaming
Multiprocessor
(SM)
... StreamingMultiprocessor
(SM)
Streaming
Multiprocessor
(SM)
Global Memory 768MB 4GB
1 GPU
1 2 N1 N
Texture Memory
Constant Memory 64KB
C Extension
Consists of:
● New Syntax and Builtin Variables
● Restrictions to ANSI C
● API/Libraries
C Extension: Builtin Variables
New Syntax:
● <<< ... >>>
● __host__, __global__, __device__
● __constant__, __shared__, __device
● __syncthreads()
C Extension: Builtin Variables
Builtin Variables:
• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
C Extension: Restrictions
New Restrictions:
● No recursion in device code
● No function pointers in device code
CUDA API
● CUDA Runtime (Host and Device)
● Device Memory Handling (cudaMalloc,...)
● Builtin Math Functions (sin, sqrt, mod, ...)
● Atomic operations (for concurrency)
● Datatypes (2D textures, dim2, dim3, ...)
GPU
Instructions
Compiling a CUDA Program
PTX to Target
Compiler
CPU
Instructions
gcc
Virtual PTX
Code
C/C++ Code NVCC
C/C++ CUDA
Application
Matrix Transpose
M
i,j
M
j,i
A B
C D
A C
B DT
T T
T
Matrix Transpose
Matrix Transpose: First idea
● Each thread block transposes
an equalsized block of matrix
M
● Assume M is square (n x n)
● What is a good blocksize?
● CUDA places limitations on
number of threads per block
● 512 threads per block is the
maximum allowed by CUDA
n
nMatrix M
Matrix Transpose: First idea
#include
#include
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++)
{ M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE);
transpose<<>>(Md, Bd, WIDTH);
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
Further Reading
● Online Course:
– UIUC NVIDIA Programming Course by David Kirk and Wen Mei W. Hwu
– http://courses.ece.illinois.edu/ece498/al/Syllabus.html
● CUDA@MIT '09
– http://sites.google.com/site/cudaiap2009/materials1/lectures
● Great Memory Latency Study:
– ”LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs” by
Vasily & Demmel
● Book of advanced examples:
– ”GPU Gems 3” Edited by Hubert Nguyen
● CUDA SDK
– Tons of source code examples available for download from NVIDIA's website
本文档为【cuda_many_cores】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。