cuda_many_cores

cuda_many_cores CUDA Basics Murphy Stein New York University Overview ● Device Architecture ● CUDA Programming Model ● Matrix Transpose in CUDA ● Further Reading What is CUDA? CUDA stands for: ”Compute Unified Device Architecture” It is 2 things: 1....

CUDA Basics Murphy Stein New York University Overview ● Device Architecture ● CUDA Programming Model ● Matrix Transpose in CUDA ● Further Reading What is CUDA? CUDA stands for: ”Compute Unified Device Architecture” It is 2 things: 1. Device Architecture Specification 2. A small extension to C = New Syntax + Builtin Variables – Restrictions + Libraries Device Architecture: Streaming Multiprocessor (SM) 1 SM contains 8 scalar cores ● Up to 8 cores can run simulatenously ● Each core executes identical instruction set, or sleeps ● SM schedules instructions across cores with 0 overhead ● Up to 32 threads may be scheduled at a time, called a warp, but max 24 warps active in 1 SM ● Threadlevel memorysharing supported via Shared Memory ● Register memory is local to thread, and divided amongst all blocks on SM Registers 8KB SM Instruction Fetch/Dispatch ... Streaming Core #1 Streaming Core #2 Streaming Core #8 Streaming Core #3 Shared Memory 16KB Texture Memory Cache 58 KB Constant Memory Cache 8KB Transparent Scalability • Hardware is free to assigns blocks to any processor at any time – A kernel scales across any number of parallel processors Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks. time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 SM Warp Scheduling • SM hardware implements zero overhead Warp scheduling – Warps whose next instruction has its operands ready for consumption are eligible for execution – Eligible Warps are selected for execution on a prioritized scheduling policy – All threads in a Warp execute the same instruction when selected • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 – If one global memory access is needed for every 4 instructions – A minimal of 13 Warps are needed to fully tolerate 200cycle memory latency warp 8 instruction 11 SM multithreaded Warp scheduler warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 ... time warp 3 instruction 96 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 Streaming Multiprocessor (SM) Device Architecture Host Block Execution Manager Streaming Multiprocessor (SM) ... StreamingMultiprocessor (SM) Streaming Multiprocessor (SM) Global Memory 768MB 4GB 1 GPU 1 2 N1 N Texture Memory Constant Memory 64KB C Extension Consists of: ● New Syntax and Builtin Variables ● Restrictions to ANSI C ● API/Libraries C Extension: Builtin Variables New Syntax: ● <<< ... >>> ● __host__, __global__, __device__ ● __constant__, __shared__, __device ● __syncthreads() C Extension: Builtin Variables Builtin Variables: • dim3 gridDim; – Dimensions of the grid in blocks (gridDim.z unused) • dim3 blockDim; – Dimensions of the block in threads • dim3 blockIdx; – Block index within the grid • dim3 threadIdx; – Thread index within the block C Extension: Restrictions New Restrictions: ● No recursion in device code ● No function pointers in device code CUDA API ● CUDA Runtime (Host and Device) ● Device Memory Handling (cudaMalloc,...) ● Builtin Math Functions (sin, sqrt, mod, ...) ● Atomic operations (for concurrency) ● Datatypes (2D textures, dim2, dim3, ...) GPU Instructions Compiling a CUDA Program PTX to Target Compiler CPU Instructions gcc Virtual PTX Code C/C++ Code NVCC C/C++ CUDA Application Matrix Transpose M i,j M j,i A B C D A C B DT T T T Matrix Transpose Matrix Transpose: First idea ● Each thread block transposes an equalsized block of matrix M ● Assume M is square (n x n) ● What is a good blocksize? ● CUDA places limitations on number of threads per block ● 512 threads per block is the maximum allowed by CUDA n nMatrix M Matrix Transpose: First idea #include #include __global__ void transpose(float* in, float* out, uint width) { uint tx = blockIdx.x * blockDim.x + threadIdx.x; uint ty = blockIdx.y * blockDim.y + threadIdx.y; out[tx * width + ty] = in[ty * width + tx]; } int main(int args, char** vargs) { const int HEIGHT = 1024; const int WIDTH = 1024; const int SIZE = WIDTH * HEIGHT * sizeof(float); dim3 bDim(16, 16); dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y); float* M = (float*)malloc(SIZE); for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; } float* Md = NULL; cudaMalloc((void**)&Md, SIZE); cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice); float* Bd = NULL; cudaMalloc((void**)&Bd, SIZE); transpose<<>>(Md, Bd, WIDTH); cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost); return 0; } Further Reading ● Online Course: – UIUC NVIDIA Programming Course by David Kirk and Wen Mei W. Hwu – http://courses.ece.illinois.edu/ece498/al/Syllabus.html ● CUDA@MIT '09 – http://sites.google.com/site/cudaiap2009/materials1/lectures ● Great Memory Latency Study: – ”LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs” by Vasily & Demmel ● Book of advanced examples: – ”GPU Gems 3” Edited by Hubert Nguyen ● CUDA SDK – Tons of source code examples available for download from NVIDIA's website

                    本文档为【cuda_many_cores】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

cuda_many_cores

你可能还喜欢