The BlackWidow High-Radix Clos Network
Steve Scott∗ Dennis Abts∗ John Kim† William J. Dally†
sscott@cray.com dabts@cray.com jjk12@stanford.edu dally@stanford.edu
∗Cray Inc. † Stanford University
Chippewa Falls, Wisconsin 54729 Computer Systems Laboratory
Stanford, California 94305
Abstract
This paper describes the radix-64 folded-Clos network of the
Cray BlackWidow scalable vector multiprocessor. We describe the
BlackWidow network which scales to 32K processors with a worst-
case diameter of seven hops, and the underlying high-radix router
microarchitecture and its implementation. By using a high-radix
router with many narrow channels we are able to take advantage
of the higher pin density and faster signaling rates available in
modern ASIC technology. The BlackWidow router is an 800 MHz
ASIC with 64 18.75Gb/s bidirectional ports for an aggregate off-
chip bandwidth of 2.4Tb/s. Each port consists of three 6.25Gb/s
differential signals in each direction. The router supports deter-
ministic and adaptive packet routing with separate buffering for
request and reply virtual channels. The router is organized hier-
archically [13] as an 8×8 array of tiles which simplifies arbitra-
tion by avoiding long wires in the arbiters. Each tile of the array
contains a router port, its associated buffering, and an 8×8 router
subswitch. The router ASIC is implemented in a 90nm CMOS stan-
dard cell ASIC technology and went from concept to tapeout in 17
months.
1 Introduction
The interconnection network plays a critical role in the cost
and performance of a scalable multiprocessor. It determines the
point-to-point and global bandwidth of the system, as well as the
latency for remote communication. Latency is particularly impor-
tant for shared-memory multiprocessors, in which memory access
and synchronization latencies can significantly impact application
scalability, and is becoming a greater concern as system sizes grow
and clock cycles shrink.
The Cray BlackWidow system is designed to run demanding
applications with high communication requirements. It provides a
globally shared memory with direct load/store access, and, unlike
conventional microprocessors, each processor in the BlackWidow
system can support thousands of outstanding global memory refer-
ences. The network must therefore provide very high global band-
width, while also providing low latency for efficient synchroniza-
tion and scalability.
Over the past 15 years the vast majority of interconnection net-
works have used low-radix topologies. Many mutiprocessors have
used a low-radix k-ary n-cube or torus topology [6] including the
SGI Origin2000 hypercube [14], the dual-bristled, sliced 2-D torus
of the Cray X1 [3], the 3-D torus of the Cray T3E [20] and Cray
XT3 [5], and the torus of the Alpha 21364 [18]. The Quadrics
switch [1] uses a radix-8 router, the Mellanox router [17] is radix-
24, and the highest radix available from Myrinet is radix-32 [19].
The IBM SP2 switch [22] is radix-8.
The BlackWidow network uses a high-radix folded Clos [2] or
fat-tree [15] topology with sidelinks. A low-radix fat-tree topol-
ogy was used in the the CM-5 [16], and this topology is also used
in many clusters, including the Cray XD1 [4]. The BlackWidow
topology extends this previous work by using a high-radix and
adding sidelinks to the topology.
During the past 15 years, the total bandwidth per router has
increased by nearly three orders of magnitude, due to a combina-
tion of higher pin density and faster signaling rates, while typi-
cal packet sizes have remained roughly constant. This increase in
router bandwidth relative to packet size motivates networks built
from many thin links rather than fewer fat links as in the recent
past[13]. Building a network using high-radix routers with many
narrow ports reduces the latency and cost of the resulting network.
The design of the YARC1 router and BlackWidow network
make several contributions to the field of interconnection network
design:
• The BlackWidow topology extends the folded-Clos topology
to include sidelinks, which allow the global network band-
width to be statically partitioned among the peer subtrees,
reducing the cost and the latency of the network.
• The YARC microarchitecture is adapted to the constraints
imposed by modern ASIC technology — abundant wiring
but limited buffers. The abundant wiring available in the
ASIC process enabled an 8× speedup in the column orga-
nization of the YARC switch, greatly simplifying global ar-
bitration. At the same time, wormhole flow control was used
internal to YARC because insufficient buffers were available
to support virtual cut-through flow control.
• YARC provides fault tolerance by using a unique routing ta-
ble structure to configure the network to avoid bad links and
nodes. YARC also provides link-level retry and automati-
1YARC stands for ’Yet Another Router Chip’, and is also Cray spelled
backwards.
cally reconfigures to reduce channel width to avoid a faulty
bit or bits.
• YARC employs both an adaptive routing method and a deter-
ministic routing method based on address hashing to balance
load across network channels and routers.
This paper describes the BlackWidow (BW) multiprocessor
network and the microarchitecture of YARC, the high-radix router
chip used in the BW network. The rest of the paper is organized
as follows. An overview of the BlackWidow network and the
high-radix Clos topology is described in Section 2. We provide
an overview of the microarchitecture of the YARC router used in
the BlackWidow network in Section 3. In Section 4, the commu-
nication stack of the router is discussed and the routing within the
BlackWidow network is described in Section 5. The implemen-
tation of the YARC router is described in Section 6. We provide
some discussions in Section 7 on key design points of the Black-
Widow network and the YARC router, and present conclusion in
Section 8.
2 The BlackWidow Network
2.1 System Overview
The Cray BlackWidow multiprocessor is the follow-on to the
Cray X1. It is a distributed shared memory multiprocessor built
with high performance, high bandwidth custom processors. The
processors support latency hiding, addressing and synchronization
features that facilitate scaling to large system sizes. Each Black-
Widow processor is implemented on a single chip and includes a
4-way-dispatch scalar core, 8 vector pipes, two levels of cache and
a set of ports to the local memory system.
The system provides a shared memory with global load/store
access. It is globally cache coherent, but each processor only
caches data from memory within its four-processor node. This
provides natural support for SMP applications on a single node,
and hierarchical (e.g.: shmem or MPI on top of OpenMP) applica-
tions across the entire machine. Pure distributed memory applica-
tions (MPI, shmem, CAF, UPC) are of course also supported, and
expected to represent the bulk of the workload.
2.2 Topology and Packaging
The BlackWidow network uses YARC high-radix routers, each
of which has 64 ports that are three bits wide in each direction.
Each BW processor has four injection ports into the network (Fig-
ure 1), with each port connecting to a different network slice. Each
slice is a completely separate network with its own set of YARC
router chips. The discussion of the topology in this section focuses
on a single slice of the network.
The BlackWidow network scales up to 32K processors using
a variation on a folded-Clos or fat-tree network topology that can
be incrementally scaled. The BW system is packaged in mod-
ules, chassis, and cabinets. Each compute module contains eight
processors with four network ports each. A chassis holds eight
compute modules organized as two 32-processor rank 1 (R1) sub-
trees, and up to four R1 router modules (each of which provides
two network slices for one of the subtrees). Each R1 router module
Figure 1. The BlackWidow network building blocks are
32-processor local groups connected via two rank 1
router modules each with two YARC (Y) router chips.
contains two 64-port YARC router chips (Figure 1) providing 64
downlinks that are routed to the processor ports via a mid-plane,
and 64 uplinks (or sidelinks) that are routed to eight 96-pin cable
connectors that carry eight links each.2 Each cabinet holds two
chassis (128 processors) organized as four 32-processors R1 sub-
trees. Machines with up to 288 processors, nine R1 subtrees, can
be connected by directly cabling the R1 subtrees to one another us-
ing sidelinks as shown in 2(a) and 2(b) to create a rank 1.5 (R1.5)
network.
To scale beyond 288 processors, the uplink cables from each
R1 subtree are connected to rank 2 (R2) routers. A rank 2/3 router
module (Figure 2c) packages four YARC router chips on an R2/R3
module. The four radix-64 YARC chips on the R2/R3 module are
each split into two radix-32 virtual routers (see Section 7.4). Log-
ically, each R2/R3 module has eight radix-32 routers providing
256 network links on 32 cable connectors. Up to 16 R2/R3 router
modules are packaged into a stand-alone router cabinet.
Machines of up to 1024 processors can be constructed by con-
necting up to 32 32-processor R1 subtrees to R2 routers. Machines
of up to 4.5K processors can be constructed by connecting up to
9 512-processor R2 subtrees via side links. Up to 16K proces-
sors may be connected by a rank 3 (R3) network where up to 32
512-processor R2 subtrees are connected by R3 routers. In theory
networks up to 72K processors could be constructed by connect-
ing nine R3 subtrees via side links; however, the maximum-size
BW system is 32K processors.
The BW topology and packaging scheme enables very flexible
provisioning of network bandwidth. For instance, by only using
2Each network cable carries eight links to save cost and mitigate cable
bulk.
wz
在文本上注释
4路调度
Figure 2. The BlackWidow network scales up to 32K pro-
cessors. Each rank 1 (R1) router module connects 32
BW processors and the rank 2/3 (R2/R3) modules con-
nect multiple R1 subtrees.
Figure 3. YARC router microarchitectural block diagram.
YARC is divided into a 8×8 array of tiles where each tile
contains an input queue, row buffers, column buffers,
and an 8×8 subswitch.
a single rank 1 router module (instead of two as shown in Fig-
ure 1), the port bandwidth of each processor is reduced in half —
halving both the cost of the network and its global bandwidth. An
additional bandwidth taper can be achieved by connecting only a
subset of the rank 1 to rank 2 network cables, reducing cabling
cost and R2 router cost at the expense of the bandwidth taper.
3 YARC Microarchitecture
The input-queued crossbar organization often used in low-radix
routers does not scale efficiently to high radices because the arbi-
tration logic and wiring complexity both grow quadratically with
the number of inputs. To overcome this complexity, we use a hier-
archical organization similar to that proposed by [13]. As shown
in Figure 3, YARC is organized as an 8×8 array of tiles. Each
tile contains all of the logic and buffering associated with one
input port and one output port. Each tile also contains an 8×8
switch and associated buffers. Each tile’s switch accepts inputs
from eight row buses that are driven by the input ports in its row,
and drives separate output channels to the eight output ports in
its column. Using a tile-based microarchitecture facilitates imple-
mentation, since each tile is identical and produces a very regular
structure for replication and physical implementation in silicon.
The YARC microarchitecture is best understood by following a
packet through the router. A packet arrives in the input buffer of a
tile. When the packet reaches the head of the buffer a routing deci-
sion is made to select the output column for the packet. The packet
is then driven onto the row bus associated with the input port and
buffered in a row buffer at the input of the 8×8 switch at the junc-
Figure 4. YARC pipeline diagram shows the tile divided into three blocks: input queue, subswitch, and column buffers.
tion of the packet’s input row and output column. At this point the
routing decision must be refined to select a particular output port
within the output column. The switch then routes the packet to
the column channel associated with the selected output port. The
column channel delivers the packet to an output buffer (associated
with the input row) at the output port multiplexer. Packets in the
per-input-row output buffers arbitrate for access to the output port
and, when granted access, are switched onto the output port via
the multiplexer.
There are three sets of buffers in YARC: input buffers, row
buffers, and column buffers. Each buffer is partitioned into two
virtual channels. One input buffer and 8 row buffers are associ-
ated with each input port. Thus, no arbitration is needed to allocate
these buffers — only flow control. Eight column buffers are asso-
ciated with each subswitch. Allocation of these column buffers
takes place at the same time the packet is switched.
Like the design of [13], output arbitration is performed in two
stages. The first stage of arbitration is done to gain access to the
output of the subswitch. A packet then competes with packets
from other tiles in the same column in the second stage of arbitra-
tion for access to the output port. Unlike the hierarchical cross-
bar in [13], the YARC router takes advantage of the abundant on-
chip wiring resources to run separate channels from each output of
each subswitch to the corresponding output port. This organiza-
tion places the column buffers in the output tiles rather than at the
output of the subswitches. Co-locating the eight column buffers
associated with a given output in a single tile simplifies global
output arbitration. With column buffers at the outputs of the sub-
switch, the requests/grants to/from the global arbiters would need
to be pipelined to account for wire delay which would complicate
the arbitration logic.
As shown in Figure 4, a packet traversing the YARC router
passes through 25 pipeline stages which results in a zero-load la-
tency of 31.25ns. To simplify implementation of YARC, each ma-
jor block: input queue, subswitch, and column buffers, was de-
signed with both input and output registers. This approach simpli-
fied system timing at the expense of latency. During the design,
additional pipeline stages were inserted to pipeline the wire delay
associated with the row busses and the column channels.
4 Communication Stack
This section describes the three layers of the communication
stack: network layer, data-link layer, and physical layer. We dis-
cuss the packet format, flow control across the network links, the
link control block (LCB) which implements the data-link layer,
and the serializer/deserializer (SerDes) at the physical layer.
4.1 Packet Format
The format of a packet within the BlackWidow network is
shown in Figure 5. Packets are divided into 24-bit phits for trans-
mission over internal YARC datapaths. These phits are further
serialized for transmission over 3-bit wide network channels. A
minimum packet contains 4 phits carrying 32 payload bits. Longer
packets are constructed by inserting additional payload phits (like
the third phit in the figure) before the tail phit. Two-bits of each
phit, as well as all of the tail phit are used by the data-link layer.
The head phit of the packet controls routing which will be de-
scribed in detail in Section 5. In addition to specifying the destina-
tion, this phit contains a v bit that specifies which virtual channel
to use, and three bits, h, a, and r, that control routing. If the r
bit is set, the packet will employ source routing. In this case, the
packet header will be accompanied by a routing vector that indi-
cates the path through the network as a list of ports to select the
output port at each hop. Source routed packets are used only for
maintenance operations such as reading and writing configuration
registers on the YARC. If the a bit is set, the packet will route
adaptively, otherwise it will route deterministically. If the h bit is
set, the deterministic routing algorithm employs the hash bits in
the second phit to select the output port.
4.2 Network Layer Flow Control
The allocation unit for flow control is a 24-bit phit — thus,
the phit is really the flit (flow control unit). The BlackWidow net-
work uses two virtual channels (VCs) [7], designated request (v=0)
and response (v=1) to avoid request-response deadlocks in the net-
work. Therefore, all buffer resources are allocated according to
the virtual channel bit in the head phit. Each input buffer is 256
phits and is sized to cover the round-trip latency across the net-
work channel. Virtual cut-through flow control [12] is used across
the network links.
Figure 5. Packet format of the BlackWidow network.
4.3 Data-link Layer Protocol
The YARC data-link layer protocol is implemented by the link
control block (LCB). The LCB receives phits from the router core
and injects them into the serializer logic where they are transmitted
over the physical medium. The primary function of the LCB is to
reliably transmit packets over the network links using a sliding
window go-back-N protocol. The send buffer storage and retry is
on a packet granularity.
The 24-bit phit uses 2-bits of sideband dedicated as a control
channel for the LCB to carry sequence numbers and status infor-
mation. The virtual channel acknowledgment status bits travel in
the LCB sideband. These VC acks are used to increment the per-
vc credit counters in the output port logic. The ok field in the EOP
phit indicates if the packet is healthy, encountered a transmission
error on the current link (transmit error), or was corrupted prior
to transmission (soft error). The YARC internal datapath uses the
CRC to detect soft errors in the pipeline data paths and static mem-
ories used for storage. Before transmitting a tail phit onto the net-
work link, the LCB will check the current CRC against the packet
contents to determine if a soft error has corrupted the packet. If
the packet is corrupted, it is marked as soft error, and a good CRC
is generated so that it is not detected by the receiver as a transmis-
sion error. The packet will continue to flow through the network
marked as a bad packet with a soft error and eventually be dis-
carded by the network interface at the destination processor.
The narrow links of a high-radix router cause a higher serializa-
tion latency to squeeze the packet over a link. For example, a 32B
cache-line write results in a packet with 19 phits (6 header, 12 data,
and 1 EOP). Consequently, the LCB passes phits up to the higher-
level logic speculatively, prior to verifying the packet CRC, which
avoids store-and-forward serialization latency at each hop. How-
ever, this early forwarding complicates various error conditions in
order to correctly handle a packet with a transmission error and
reclaim the space in the input queue at the receiver.
Because a packet with a transmission error is speculatively
passed up to the router core and may have already flowed to the
next router by the time the tail phit is processed, the LCB and input
queue must prevent corrupting the router state. The LCB detects
packet CRC errors and marks the packet as transmit error with a
corrected CRC before handing the end-of-packet (EOP) phit up to
the router core. The LCB also monitors the packet length of the re-
ceived data stream and clips any packets that exceed the maximum
packet length, which is programmed into an LCB configuration
register. When a packet is clipped, an EOP phit is appended to
the truncated packet and it is marked as transmit error. On ei-
ther error, the LCB will enter error recovery mode and await the
retransmission.
The input queue in the router must protect from overflow. If it
receives more phits than can be
本文档为【The Black Widow high Radix clos network】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。