邓仰东：基于GPU的高性能嵌入式计算_CUDA技术沙龙

邓仰东：基于GPU的高性能嵌入式计算_CUDA技术沙龙nullHigh Performance Embedded Computing with Massively Parallel ProcessorsHigh Performance Embedded Computing with Massively Parallel ProcessorsYangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua UniversityOutlineOutlineMotivation and background Morphing G...

nullHigh Performance Embedded Computing with Massively Parallel ProcessorsHigh Performance Embedded Computing with Massively Parallel ProcessorsYangdong Steve Deng 邓仰东 dengyd@tsinghua.edu.cn Tsinghua UniversityOutlineOutlineMotivation and background Morphing GPU into a network processor High performance radar DSP processor ConclusionHigh Performance Embedded ComputingHigh Performance Embedded ComputingFuture IT infrastructure demands even higher computing power Core Internet router throughput: up to 90Tbps 4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area CMU driverless car: 270GFLOPs (Giga FLoating point Operations Per second) …Fast Increasing IC CostsFast Increasing IC CostsFabrication Cost Moore’s Second Law: The cost of doubling circuit density increases in line with Moore's First Law. Design Cost Now $20-50M per product Will reach $75-120M at 32nm nodeThe 4-year development of Cell processor by Sony, IBM, and Toshiba costs over $400M.Implications of the Prohibitive CostImplications of the Prohibitive CostASICs would be unaffordable for many applications! Scott MacGregor, CEO of Broadcom: “Broadcom is not intending a move to 45nm in the next year or so as it will be too expensive.” David Turek, VP of IBM: “IBM will be pulling out of Cell development, with PowerXCell 8i to be the company’s last entrance in the technology.”Multicore Machines Are Really Powerful!Multicore Machines Are Really Powerful!AMD 12-Core CPUTilera Tile Gx100 CPUNVidia Fermi GPUGPU: Graphics Processing Unit GPGPU: General Purpose GPUImplicationsImplicationsAn increasing number of applications would be implemented with multi-core devices Huawei: multi-core base stations Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor … Also meets the strong demands for customizability and extendibilityOutlineOutlineMotivation and background Morphing GPU into a network processor High performance radar DSP processor ConclusionSoftware Routing with GPUBackground and motivation GPU based routing processing Routing table lookup Packet classification Deep packet inspection GPU microarchitecture enhancement CPU and GPU integration QoS-aware schedulingSoftware Routing with GPUEver-Increasing Internet TrafficEver-Increasing Internet TrafficFast Changing Network Protocols/ServicesFast Changing Network Protocols/ServicesNew services are rapidly appearing Data-center, Ethernet forwarding, virtual LAN, … Personal customization is often essential for QoS However, today’s Internet heavily depend on 2 protocols Ethernet and IPv4, with both developed in 1970s!Internet RouterInternet RouterInternet RouterInternet RouterBackbone network device Packet forwarding and path finding Connect multiple subnets Key requirements High throughput: 40G-90Tbps High flexibilityPacketsRouterPacketsCurrent Router SolutionsCurrent Router SolutionsHardware routers Fast Long design time Expensive And hard to maintain Network processor based router Network processor: data parallel packet processor No good programming models Software routers Extremely flexible Low cost But slowOutlineOutlineBackground and motivation GPU based routing processing Routing table lookup Packet classification Deep packet inspection GPU microarchitecture enhancement CPU and GPU integration QoS-aware schedulingCritical Path of Routing ProcessingCritical Path of Routing ProcessingGPU Based Software RouterGPU Based Software RouterData level parallelism = packet level parallelismRouting Table LookupRouting Table LookupRouting table contains network topology information Find the output port according to destination IP address Potentially large routing table (~1M entries) Can be updated dynamicallyAn exemplar routing tableRouting Table LookupRouting Table LookupLongest prefix match Memory bound Usually based on a trie data structure Trie: a prefix tree with strings as keys A node’s position directly reflects its key Pointer operations Widely divergent branches!GPU Based Routing Table LookupGPU Based Routing Table LookupOrganize the search trie into an array Pointer converted to offset with regard to array head 6X speedup even with frequent routing table updates Packet ClassificationPacket ClassificationMatch header fields with predefined rules Size of rule-sets can be huge (i.e., over 5000 rules)Packet ClassificationPacket ClassificationHardware solution Usually with Ternary CAM (TCAM) Expensive and power hungry Software solutions Linear search Hash based Tuple space search Convert the rules into a set of exact match GPU Based Packet ClassificationGPU Based Packet ClassificationA linear search approach Scale to rule sets with 20,000 rules Meta-programming Compile rules into CUDA code with PyCUDATreat packets destined to 166.111.66.70 - 166.111.66.77 as highest priorityif (DA >= 166.111.66.70) && (DA <= 166.111.66.77) priority = 0;GPU Based Packet ClassificationGPU Based Packet Classification~60X speedupDeep Packet Inspection (DPI)Deep Packet Inspection (DPI)Core component for network intrusion detection Against viruses, spam, software vulnerabilities, …Packet DecoderPreprocessor (Plug-ins)Detection Engine (Plug-ins)Output Stage (Plug-ins)SniffingSnortData FlowAlerts/LogsPacket streamFixed String MatchingRegular Expression MatchingExample rule: alert tcp $EXTERNAL_NET 27374 -> $HOME_NET any (msg:"BACKDOOR subseven 22"; flags: A+; content: "|0d0a5b52504c5d3030320d0a|"; GPU Based Deep Packet Inspection (DPI)GPU Based Deep Packet Inspection (DPI)Fixed string match Each rule is just a string that is disallowed Bloom-filter based search One warp for a packet and one thread for a string Throughput: 19.2Gbps (30X speed-up over SNORT)Initial Bloom FilterAfter pre-processing rulesChecking packet contentBloom VectorGPU Based Deep Packet Inspection (DPI)GPU Based Deep Packet Inspection (DPI)Regular expression matching Each rule is a regular expression e.g., a|b* = {ε, a, b, bb, bbb, ...} Aho-Corasick Algorithm Converts patterns into a finite state machine Matching is done by state traversal Memory bound Virtually no computation Compress the state table Merging don’t-cared entries Throughput: 9.3Gbps 15X speed-up over SNORT Example: P={he, she, his, hers}OutlineOutlineBackground and motivation GPU based routing processing Routing table lookup Packet classification Deep packet inspection GPU microarchitecture enhancement CPU and GPU integration QoS-aware schedulingLimitation of GPU-Based Packet ProcessingLimitation of GPU-Based Packet ProcessingPacket queueMicroarchitectural EnhancementsMicroarchitectural EnhancementsCPU-GPU integration with a shared memory Maintain current CUDA interface Implemented on GPGPU-Sim**A. Bakhoda, et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS, 2009. CPU/GPU Shared MemoryGPUMicroarchitectural EnhancementsMicroarchitectural EnhancementsUniformly one thread for one packet No thread block necessary Directly schedule and issue warps GPU fetches packet IDs from task queue when Either a sufficient number of packets are already collected Or a given interval passes after last fetch Results: Throughput Results: Throughput Results: Packet LatencyResults: Packet LatencyOutlineOutlineMotivation and background Morphing GPU into a network processor High performance radar DSP processor ConclusionHigh Performance Radar DSP ProcessorHigh Performance Radar DSP ProcessorMotivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processorResearch ObjectivesResearch ObjectivesHigh performance DSP processor For high-performance applications Radar, sonar, cellular baseband, … Performance requirements Throughput ≥ 800GFLOPs Power Efficiency ≥ 100GFLOPS/W Memory bandwidth ≥ 400Gbit/s Scale to multi-chip solutionsCurrent DSP PlatformsCurrent DSP Platforms*GDDR5: Peak Bandwidth 28.2GB/sHigh Performance Radar DSP ProcessorHigh Performance Radar DSP ProcessorMotivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processorHPEC Challenge - Radar BenchmarksHPEC Challenge - Radar BenchmarksGPU ImplementationGPU ImplementationPerformance ResultsPerformance Results*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.Performance ComparisonPerformance ComparisonGPU: NVIDIA Fermi, CPU: Intel Core 2 Duo (3.33GHz), DSP AD TigherSharc 101Instruction ProfilingInstruction ProfilingThread ProfilingThread ProfilingWarp occupancy: number of active threads in an issued warp 32 threads per warpOff-Chip Memory ProfilingOff-Chip Memory ProfilingDRAM efficiency: the percentage of time spent on sending data across the pins of DRAM over the whole time of memory service.LimitationLimitationGPU suffers from a low power-efficiency (MFLOPS/W)High Performance Radar DSP ProcessorHigh Performance Radar DSP ProcessorMotivation Feasibility of GPU for DSP processing Designing a massively parallel DSP processorKey Idea - Hardware ArchitectureKey Idea - Hardware ArchitectureBorrow the GPU microarchitecture Using a DSP core as the basic execution unit Multiprocessors organized in programmable pipelines Neighboring multiprocessors can be merged as wider datapaths Key Idea – Parallel Code GenerationKey Idea – Parallel Code GenerationMeta-programming based parallel code generation Foundation technologies GPU meta-programming frameworks Copperhead (UC Berkeley) and PyCUDA (NY University) DSP code generation framework Spiral (Carnegie Mellon University)Key Idea – Internal Representation as KPNKey Idea – Internal Representation as KPNKahn Process Network (KPN) A generic model for concurrent computation Solid theoretic foundation Process algebraScheduling and Optimization on KPNScheduling and Optimization on KPNAutomatic task and thread scheduling and mapping Extract data parallelism through process splitting Latency and throughput aware scheduling Performance estimation based on analytical models Key Idea - Low Power TechniquesKey Idea - Low Power TechniquesGPU-like processors are power hungry! Potential low power techniques Aggressive memory coalescing Enable task-pipeline to avoid synchronization via global memory Operation chaining to avoid extra memory accesses ???OutlineOutlineMotivation and background Morphing GPU into a network processor High performance radar DSP processor ConclusionConclusionConclusionA new market of high performance embedded computing is emerging Multi-core engines would be the work-horses Need both HW and SW research Case study 1: GPU based Internet routing Case study 2: Massively parallel DSP processor Significant performance improvements More works ahead Low power, scheduling, parallel programming model, legacy code, …

                    本文档为【邓仰东：基于GPU的高性能嵌入式计算_CUDA技术沙龙】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

邓仰东：基于GPU的高性能嵌入式计算_CUDA技术沙龙

你可能还喜欢