The Black Widow high Radix clos network

The Black Widow high Radix clos network The BlackWidow High-Radix Clos Network Steve Scott∗ Dennis Abts∗ John Kim† William J. Dally† sscott@cray.com dabts@cray.com jjk12@stanford.edu dally@stanford.edu ∗Cray Inc. † Stanford University Chippewa Falls, Wisconsin 54729 Computer Systems Laboratory S...

The BlackWidow High-Radix Clos Network Steve Scott∗ Dennis Abts∗ John Kim† William J. Dally† sscott@cray.com dabts@cray.com jjk12@stanford.edu dally@stanford.edu ∗Cray Inc. † Stanford University Chippewa Falls, Wisconsin 54729 Computer Systems Laboratory Stanford, California 94305 Abstract This paper describes the radix-64 folded-Clos network of the Cray BlackWidow scalable vector multiprocessor. We describe the BlackWidow network which scales to 32K processors with a worst- case diameter of seven hops, and the underlying high-radix router microarchitecture and its implementation. By using a high-radix router with many narrow channels we are able to take advantage of the higher pin density and faster signaling rates available in modern ASIC technology. The BlackWidow router is an 800 MHz ASIC with 64 18.75Gb/s bidirectional ports for an aggregate off- chip bandwidth of 2.4Tb/s. Each port consists of three 6.25Gb/s differential signals in each direction. The router supports deter- ministic and adaptive packet routing with separate buffering for request and reply virtual channels. The router is organized hier- archically [13] as an 8×8 array of tiles which simplifies arbitra- tion by avoiding long wires in the arbiters. Each tile of the array contains a router port, its associated buffering, and an 8×8 router subswitch. The router ASIC is implemented in a 90nm CMOS stan- dard cell ASIC technology and went from concept to tapeout in 17 months. 1 Introduction The interconnection network plays a critical role in the cost and performance of a scalable multiprocessor. It determines the point-to-point and global bandwidth of the system, as well as the latency for remote communication. Latency is particularly impor- tant for shared-memory multiprocessors, in which memory access and synchronization latencies can significantly impact application scalability, and is becoming a greater concern as system sizes grow and clock cycles shrink. The Cray BlackWidow system is designed to run demanding applications with high communication requirements. It provides a globally shared memory with direct load/store access, and, unlike conventional microprocessors, each processor in the BlackWidow system can support thousands of outstanding global memory refer- ences. The network must therefore provide very high global band- width, while also providing low latency for efficient synchroniza- tion and scalability. Over the past 15 years the vast majority of interconnection net- works have used low-radix topologies. Many mutiprocessors have used a low-radix k-ary n-cube or torus topology [6] including the SGI Origin2000 hypercube [14], the dual-bristled, sliced 2-D torus of the Cray X1 [3], the 3-D torus of the Cray T3E [20] and Cray XT3 [5], and the torus of the Alpha 21364 [18]. The Quadrics switch [1] uses a radix-8 router, the Mellanox router [17] is radix- 24, and the highest radix available from Myrinet is radix-32 [19]. The IBM SP2 switch [22] is radix-8. The BlackWidow network uses a high-radix folded Clos [2] or fat-tree [15] topology with sidelinks. A low-radix fat-tree topol- ogy was used in the the CM-5 [16], and this topology is also used in many clusters, including the Cray XD1 [4]. The BlackWidow topology extends this previous work by using a high-radix and adding sidelinks to the topology. During the past 15 years, the total bandwidth per router has increased by nearly three orders of magnitude, due to a combina- tion of higher pin density and faster signaling rates, while typi- cal packet sizes have remained roughly constant. This increase in router bandwidth relative to packet size motivates networks built from many thin links rather than fewer fat links as in the recent past[13]. Building a network using high-radix routers with many narrow ports reduces the latency and cost of the resulting network. The design of the YARC1 router and BlackWidow network make several contributions to the field of interconnection network design: • The BlackWidow topology extends the folded-Clos topology to include sidelinks, which allow the global network band- width to be statically partitioned among the peer subtrees, reducing the cost and the latency of the network. • The YARC microarchitecture is adapted to the constraints imposed by modern ASIC technology — abundant wiring but limited buffers. The abundant wiring available in the ASIC process enabled an 8× speedup in the column orga- nization of the YARC switch, greatly simplifying global ar- bitration. At the same time, wormhole flow control was used internal to YARC because insufficient buffers were available to support virtual cut-through flow control. • YARC provides fault tolerance by using a unique routing ta- ble structure to configure the network to avoid bad links and nodes. YARC also provides link-level retry and automati- 1YARC stands for ’Yet Another Router Chip’, and is also Cray spelled backwards. cally reconfigures to reduce channel width to avoid a faulty bit or bits. • YARC employs both an adaptive routing method and a deter- ministic routing method based on address hashing to balance load across network channels and routers. This paper describes the BlackWidow (BW) multiprocessor network and the microarchitecture of YARC, the high-radix router chip used in the BW network. The rest of the paper is organized as follows. An overview of the BlackWidow network and the high-radix Clos topology is described in Section 2. We provide an overview of the microarchitecture of the YARC router used in the BlackWidow network in Section 3. In Section 4, the commu- nication stack of the router is discussed and the routing within the BlackWidow network is described in Section 5. The implemen- tation of the YARC router is described in Section 6. We provide some discussions in Section 7 on key design points of the Black- Widow network and the YARC router, and present conclusion in Section 8. 2 The BlackWidow Network 2.1 System Overview The Cray BlackWidow multiprocessor is the follow-on to the Cray X1. It is a distributed shared memory multiprocessor built with high performance, high bandwidth custom processors. The processors support latency hiding, addressing and synchronization features that facilitate scaling to large system sizes. Each Black- Widow processor is implemented on a single chip and includes a 4-way-dispatch scalar core, 8 vector pipes, two levels of cache and a set of ports to the local memory system. The system provides a shared memory with global load/store access. It is globally cache coherent, but each processor only caches data from memory within its four-processor node. This provides natural support for SMP applications on a single node, and hierarchical (e.g.: shmem or MPI on top of OpenMP) applica- tions across the entire machine. Pure distributed memory applica- tions (MPI, shmem, CAF, UPC) are of course also supported, and expected to represent the bulk of the workload. 2.2 Topology and Packaging The BlackWidow network uses YARC high-radix routers, each of which has 64 ports that are three bits wide in each direction. Each BW processor has four injection ports into the network (Fig- ure 1), with each port connecting to a different network slice. Each slice is a completely separate network with its own set of YARC router chips. The discussion of the topology in this section focuses on a single slice of the network. The BlackWidow network scales up to 32K processors using a variation on a folded-Clos or fat-tree network topology that can be incrementally scaled. The BW system is packaged in mod- ules, chassis, and cabinets. Each compute module contains eight processors with four network ports each. A chassis holds eight compute modules organized as two 32-processor rank 1 (R1) sub- trees, and up to four R1 router modules (each of which provides two network slices for one of the subtrees). Each R1 router module Figure 1. The BlackWidow network building blocks are 32-processor local groups connected via two rank 1 router modules each with two YARC (Y) router chips. contains two 64-port YARC router chips (Figure 1) providing 64 downlinks that are routed to the processor ports via a mid-plane, and 64 uplinks (or sidelinks) that are routed to eight 96-pin cable connectors that carry eight links each.2 Each cabinet holds two chassis (128 processors) organized as four 32-processors R1 sub- trees. Machines with up to 288 processors, nine R1 subtrees, can be connected by directly cabling the R1 subtrees to one another us- ing sidelinks as shown in 2(a) and 2(b) to create a rank 1.5 (R1.5) network. To scale beyond 288 processors, the uplink cables from each R1 subtree are connected to rank 2 (R2) routers. A rank 2/3 router module (Figure 2c) packages four YARC router chips on an R2/R3 module. The four radix-64 YARC chips on the R2/R3 module are each split into two radix-32 virtual routers (see Section 7.4). Log- ically, each R2/R3 module has eight radix-32 routers providing 256 network links on 32 cable connectors. Up to 16 R2/R3 router modules are packaged into a stand-alone router cabinet. Machines of up to 1024 processors can be constructed by con- necting up to 32 32-processor R1 subtrees to R2 routers. Machines of up to 4.5K processors can be constructed by connecting up to 9 512-processor R2 subtrees via side links. Up to 16K proces- sors may be connected by a rank 3 (R3) network where up to 32 512-processor R2 subtrees are connected by R3 routers. In theory networks up to 72K processors could be constructed by connect- ing nine R3 subtrees via side links; however, the maximum-size BW system is 32K processors. The BW topology and packaging scheme enables very flexible provisioning of network bandwidth. For instance, by only using 2Each network cable carries eight links to save cost and mitigate cable bulk. wz 在文本上注释 4路调度 Figure 2. The BlackWidow network scales up to 32K pro- cessors. Each rank 1 (R1) router module connects 32 BW processors and the rank 2/3 (R2/R3) modules con- nect multiple R1 subtrees. Figure 3. YARC router microarchitectural block diagram. YARC is divided into a 8×8 array of tiles where each tile contains an input queue, row buffers, column buffers, and an 8×8 subswitch. a single rank 1 router module (instead of two as shown in Fig- ure 1), the port bandwidth of each processor is reduced in half — halving both the cost of the network and its global bandwidth. An additional bandwidth taper can be achieved by connecting only a subset of the rank 1 to rank 2 network cables, reducing cabling cost and R2 router cost at the expense of the bandwidth taper. 3 YARC Microarchitecture The input-queued crossbar organization often used in low-radix routers does not scale efficiently to high radices because the arbi- tration logic and wiring complexity both grow quadratically with the number of inputs. To overcome this complexity, we use a hier- archical organization similar to that proposed by [13]. As shown in Figure 3, YARC is organized as an 8×8 array of tiles. Each tile contains all of the logic and buffering associated with one input port and one output port. Each tile also contains an 8×8 switch and associated buffers. Each tile’s switch accepts inputs from eight row buses that are driven by the input ports in its row, and drives separate output channels to the eight output ports in its column. Using a tile-based microarchitecture facilitates imple- mentation, since each tile is identical and produces a very regular structure for replication and physical implementation in silicon. The YARC microarchitecture is best understood by following a packet through the router. A packet arrives in the input buffer of a tile. When the packet reaches the head of the buffer a routing deci- sion is made to select the output column for the packet. The packet is then driven onto the row bus associated with the input port and buffered in a row buffer at the input of the 8×8 switch at the junc- Figure 4. YARC pipeline diagram shows the tile divided into three blocks: input queue, subswitch, and column buffers. tion of the packet’s input row and output column. At this point the routing decision must be refined to select a particular output port within the output column. The switch then routes the packet to the column channel associated with the selected output port. The column channel delivers the packet to an output buffer (associated with the input row) at the output port multiplexer. Packets in the per-input-row output buffers arbitrate for access to the output port and, when granted access, are switched onto the output port via the multiplexer. There are three sets of buffers in YARC: input buffers, row buffers, and column buffers. Each buffer is partitioned into two virtual channels. One input buffer and 8 row buffers are associ- ated with each input port. Thus, no arbitration is needed to allocate these buffers — only flow control. Eight column buffers are asso- ciated with each subswitch. Allocation of these column buffers takes place at the same time the packet is switched. Like the design of [13], output arbitration is performed in two stages. The first stage of arbitration is done to gain access to the output of the subswitch. A packet then competes with packets from other tiles in the same column in the second stage of arbitra- tion for access to the output port. Unlike the hierarchical cross- bar in [13], the YARC router takes advantage of the abundant on- chip wiring resources to run separate channels from each output of each subswitch to the corresponding output port. This organiza- tion places the column buffers in the output tiles rather than at the output of the subswitches. Co-locating the eight column buffers associated with a given output in a single tile simplifies global output arbitration. With column buffers at the outputs of the sub- switch, the requests/grants to/from the global arbiters would need to be pipelined to account for wire delay which would complicate the arbitration logic. As shown in Figure 4, a packet traversing the YARC router passes through 25 pipeline stages which results in a zero-load la- tency of 31.25ns. To simplify implementation of YARC, each ma- jor block: input queue, subswitch, and column buffers, was de- signed with both input and output registers. This approach simpli- fied system timing at the expense of latency. During the design, additional pipeline stages were inserted to pipeline the wire delay associated with the row busses and the column channels. 4 Communication Stack This section describes the three layers of the communication stack: network layer, data-link layer, and physical layer. We dis- cuss the packet format, flow control across the network links, the link control block (LCB) which implements the data-link layer, and the serializer/deserializer (SerDes) at the physical layer. 4.1 Packet Format The format of a packet within the BlackWidow network is shown in Figure 5. Packets are divided into 24-bit phits for trans- mission over internal YARC datapaths. These phits are further serialized for transmission over 3-bit wide network channels. A minimum packet contains 4 phits carrying 32 payload bits. Longer packets are constructed by inserting additional payload phits (like the third phit in the figure) before the tail phit. Two-bits of each phit, as well as all of the tail phit are used by the data-link layer. The head phit of the packet controls routing which will be de- scribed in detail in Section 5. In addition to specifying the destina- tion, this phit contains a v bit that specifies which virtual channel to use, and three bits, h, a, and r, that control routing. If the r bit is set, the packet will employ source routing. In this case, the packet header will be accompanied by a routing vector that indi- cates the path through the network as a list of ports to select the output port at each hop. Source routed packets are used only for maintenance operations such as reading and writing configuration registers on the YARC. If the a bit is set, the packet will route adaptively, otherwise it will route deterministically. If the h bit is set, the deterministic routing algorithm employs the hash bits in the second phit to select the output port. 4.2 Network Layer Flow Control The allocation unit for flow control is a 24-bit phit — thus, the phit is really the flit (flow control unit). The BlackWidow net- work uses two virtual channels (VCs) [7], designated request (v=0) and response (v=1) to avoid request-response deadlocks in the net- work. Therefore, all buffer resources are allocated according to the virtual channel bit in the head phit. Each input buffer is 256 phits and is sized to cover the round-trip latency across the net- work channel. Virtual cut-through flow control [12] is used across the network links. Figure 5. Packet format of the BlackWidow network. 4.3 Data-link Layer Protocol The YARC data-link layer protocol is implemented by the link control block (LCB). The LCB receives phits from the router core and injects them into the serializer logic where they are transmitted over the physical medium. The primary function of the LCB is to reliably transmit packets over the network links using a sliding window go-back-N protocol. The send buffer storage and retry is on a packet granularity. The 24-bit phit uses 2-bits of sideband dedicated as a control channel for the LCB to carry sequence numbers and status infor- mation. The virtual channel acknowledgment status bits travel in the LCB sideband. These VC acks are used to increment the per- vc credit counters in the output port logic. The ok field in the EOP phit indicates if the packet is healthy, encountered a transmission error on the current link (transmit error), or was corrupted prior to transmission (soft error). The YARC internal datapath uses the CRC to detect soft errors in the pipeline data paths and static mem- ories used for storage. Before transmitting a tail phit onto the net- work link, the LCB will check the current CRC against the packet contents to determine if a soft error has corrupted the packet. If the packet is corrupted, it is marked as soft error, and a good CRC is generated so that it is not detected by the receiver as a transmis- sion error. The packet will continue to flow through the network marked as a bad packet with a soft error and eventually be dis- carded by the network interface at the destination processor. The narrow links of a high-radix router cause a higher serializa- tion latency to squeeze the packet over a link. For example, a 32B cache-line write results in a packet with 19 phits (6 header, 12 data, and 1 EOP). Consequently, the LCB passes phits up to the higher- level logic speculatively, prior to verifying the packet CRC, which avoids store-and-forward serialization latency at each hop. How- ever, this early forwarding complicates various error conditions in order to correctly handle a packet with a transmission error and reclaim the space in the input queue at the receiver. Because a packet with a transmission error is speculatively passed up to the router core and may have already flowed to the next router by the time the tail phit is processed, the LCB and input queue must prevent corrupting the router state. The LCB detects packet CRC errors and marks the packet as transmit error with a corrected CRC before handing the end-of-packet (EOP) phit up to the router core. The LCB also monitors the packet length of the re- ceived data stream and clips any packets that exceed the maximum packet length, which is programmed into an LCB configuration register. When a packet is clipped, an EOP phit is appended to the truncated packet and it is marked as transmit error. On ei- ther error, the LCB will enter error recovery mode and await the retransmission. The input queue in the router must protect from overflow. If it receives more phits than can be

                    本文档为【The Black Widow high Radix clos network】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

The Black Widow high Radix clos network

你可能还喜欢