SplitStream: High-Bandwidth Multicast in Cooperative
Environments
Miguel Castro1 Peter Druschel2 Anne-Marie Kermarrec1 Animesh Nandi2
Antony Rowstron1 Atul Singh2
1Microsoft Research, 7 J J Thomson Avenue, Cambridge, CB3 0FB, UK.
2Rice University, 6100 Main Street, MS-132, Houston, TX 77005, USA∗.
ABSTRACT
In tree-based multicast systems, a relatively small number of
interior nodes carry the load of forwarding multicast mes-
sages. This works well when the interior nodes are highly-
available, dedicated infrastructure routers but it poses a prob-
lem for application-level multicast in peer-to-peer systems.
SplitStream addresses this problem by striping the content
across a forest of interior-node-disjoint multicast trees that
distributes the forwarding load among all participating peers.
For example, it is possible to construct efficient SplitStream
forests in which each peer contributes only as much forward-
ing bandwidth as it receives. Furthermore, with appropri-
ate content encodings, SplitStream is highly robust to fail-
ures because a node failure causes the loss of a single stripe
on average. We present the design and implementation of
SplitStream and show experimental results obtained on an
Internet testbed and via large-scale network simulation. The
results show that SplitStream distributes the forwarding load
among all peers and can accommodate peers with different
bandwidth capacities while imposing low overhead for forest
construction and maintenance.
Categories and Subject Descriptors
C.2.4 [Computer-Communications networks]: Distributed
Systems—Distributed applications; C.2.2 [Computer-Commu-
nications networks]: Network Protocols—Applications, Rout-
ing protocols; D.4.5 [Operating Systems]: Reliability—Fault-
tolerance; D.4.8 [Operating Systems]: Performance
General Terms
Algorithms, Measurement, Performance, Reliability, Exper-
imentation
∗Supported in part by NSF (ANI-0225660) and by a Texas
ATP (003604-0079-2001) grant.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.
Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.
Keywords
Peer-to-peer, application-level multicast, end-system multi-
cast, content distribution, video streaming
1. INTRODUCTION
End-system or application-level multicast [8, 16, 22, 19,
40, 32, 13, 6, 23] has become an attractive alternative to IP
multicast. Instead of relying on a multicast infrastructure in
the network (which is not widely available), the participat-
ing hosts route and distribute multicast messages using only
unicast network services. In this paper, we are particularly
concerned with application-level multicast in peer-to-peer
(p2p) or cooperative environments where peers contribute
resources in exchange for using the service.
Unfortunately, conventional tree-based multicast is inher-
ently not well matched to a cooperative environment. The
reason is that in any multicast tree, the burden of duplicat-
ing and forwarding multicast traffic is carried by the small
subset of the peers that are interior nodes in the tree. The
majority of peers are leaf nodes and contribute no resources.
This conflicts with the expectation that all peers should
share the forwarding load. The problem is further aggra-
vated in high-bandwidth applications, like video or bulk file
distribution, where many peers may not have the capacity
and availability required of an interior node in a conventional
multicast tree. SplitStream addresses these problems by en-
abling efficient cooperative distribution of high-bandwidth
content in a peer-to-peer system.
The key idea in SplitStream is to split the content into k
stripes and to multicast each stripe using a separate tree.
Peers join as many trees as there are stripes they wish to
receive and they specify an upper bound on the number of
stripes that they are willing to forward. The challenge is to
construct this forest of multicast trees such that an interior
node in one tree is a leaf node in all the remaining trees and
the bandwidth constraints specified by the nodes are satis-
fied. This ensures that the forwarding load can be spread
across all participating peers. For example, if all nodes wish
to receive k stripes and they are willing to forward k stripes,
SplitStream will construct a forest such that the forwarding
load is evenly balanced across all nodes while achieving low
delay and link stress across the system.
Striping across multiple trees also increases the resilience
to node failures. SplitStream offers improved robustness to
node failure and sudden node departures like other systems
that exploit path diversity in overlays [4, 35, 3, 30]. Split-
Stream ensures that the vast majority of nodes are interior
nodes in only one tree. Therefore, the failure of a single
node causes the temporary loss of at most one of the stripes
(on average). With appropriate data encodings, applica-
tions can mask or mitigate the effects of node failures even
while the affected tree is being repaired. For example, ap-
plications can use erasure coding of bulk data [9] or multiple
description coding (MDC) of streaming media [27, 4, 5, 30].
The key challenge in the design of SplitStream is to con-
struct a forest of multicast trees that distributes the for-
warding load subject to the bandwidth constraints of the
participating nodes in a decentralized, scalable, efficient and
self-organizing manner. SplitStream relies on a structured
peer-to-peer overlay [31, 36, 39, 33] to construct and main-
tain these trees. We implemented a SplitStream prototype
and evaluated its performance. We show experimental re-
sults obtained on the PlanetLab [1] Internet testbed and
on a large-scale network simulator. The results show that
SplitStream achieves these goals.
The rest of this paper is organized as follows. Section 2
outlines the SplitStream approach in more detail. A brief de-
scription of the structured overlay is given in Section 3. We
present the design of SplitStream in Section 4. The results
of our experimental evaluation are presented in Section 5.
Section 6 describes related work and Section 7 concludes.
2. THE SPLITSTREAM APPROACH
In this section, we give a detailed overview of SplitStream’s
approach to cooperative, high-bandwidth content distribu-
tion.
2.1 Tree-based multicast
In all multicast systems based on a single tree, a partici-
pating peer is either an interior node or a leaf node in the
tree. The interior nodes carry all the burden of forward-
ing multicast messages. In a balanced tree with fanout f
and height h, the number of interior nodes is f
h−1
f−1 and the
number of leaf nodes is fh. Thus, the fraction of leaf nodes
increases with f . For example, more than half of the peers
are leaves in a binary tree, and over 90% of peers are leaves
in a tree with fanout 16. In the latter case, the forwarding
load is carried by less than 10% of the peers. All nodes have
equal inbound bandwidth, but the internal nodes have an
outbound bandwidth requirement of 16 times their inbound
bandwidth. Even in a binary tree, which would be impracti-
cally deep in most circumstances, the outbound bandwidth
required by the interior nodes is twice their inbound band-
width. Deep trees are not practical because they are more
fault prone and they introduce larger delays, which is a prob-
lem for some applications.
2.2 SplitStream
SplitStream is designed to overcome the inherently unbal-
anced forwarding load in conventional tree-based multicast
systems. SplitStream strives to distribute the forwarding
load over all participating peers and respects different ca-
pacity limits of individual peers. SplitStream achieves this
by splitting the multicast stream into multiple stripes, and
using separate multicast trees to distribute each stripe.
Figure 1 illustrates how SplitStream balances the forward-
ing load among the participating peers. In this simple ex-
ample, the original content is split into two stripes and mul-
2
3
5
8
stripe 1
64
7
1
Source
stripe 2
Figure 1: A simple example illustrating the basic
approach of SplitStream. The original content is
split into two stripes. An independent multicast tree
is constructed for each stripe such that a peer is an
interior node in one tree and a leaf in the other.
ticast in separate trees. For simplicity, let us assume that
the original content has a bandwidth requirement of B and
that each stripe has half the bandwidth requirement of the
original content. Each peer, other than the source, receives
both stripes inducing an inbound bandwidth requirement of
B. As shown in Figure 1, each peer is an internal node in
only one tree and forwards the stripe to two children, which
yields an outbound bandwidth of no more than B.
In general, the content is split into k stripes. Participat-
ing peers may receive a subset of the stripes, thus controlling
their inbound bandwidth requirement in increments of B/k.
Similarly, peers may control their outbound bandwidth re-
quirement in increments of B/k by limiting the number of
children they adopt. Thus, SplitStream can accommodate
nodes with different bandwidths, and nodes with unequal
inbound and outbound network capacities.
This works well when the bandwidth bottleneck in the
communication between two nodes is either at the sender
or at the receiver. While this assumption holds in many
settings, it is not universally true. If the bottleneck is else-
where, nodes may be unable to receive all desired stripes.
We plan to extend SplitStream to address this issue. For ex-
ample, nodes could monitor the packet arrival rate for each
stripe. If they detect that the incoming link for a stripe
is not delivering the expected bandwidth, they can detach
from the stripe tree and search for an alternate parent.
2.3 Applications
SplitStream provides a generic infrastructure for high-
bandwidth content distribution. Any application that uses
SplitStream controls how its content is encoded and divided
into stripes. SplitStream builds the multicast trees for the
stripes while respecting the inbound and outbound band-
width constraints of the peers. Applications need to encode
the content such that (i) each stripe requires approximately
the same bandwidth, and (ii) the content can be recon-
structed from any subset of the stripes of sufficient size.
In order for applications to tolerate the loss of a subset of
stripes, they may provide mechanisms to fetch content from
other peers in the system, or may choose to encode content
in a manner that requires greater than B/k per stripe, in
return for the ability to reconstitute the content from less
than k stripes. For example, a media stream could be en-
coded using MDC so that the video can be reconstituted
from any subset of the k stripes with video quality propor-
tional to the number of stripes received. Hence, if an interior
node in a stripe tree should fail then clients deprived of the
stripe are able to continue displaying the media stream at
reduced quality until the stripe tree is repaired. Such an en-
coding also allows low-bandwidth clients to receive the video
at lower quality by explicitly requesting less stripes.
Another example is the multicasting of file data with era-
sure coding [9]. Each data block is encoded using erasure
codes to generate k blocks such that only a (large) subset
of the k blocks is required to reconstitute the original block.
Each stripe is then used to multicast a different one of the k
blocks. Participants receive all stripes and once a sufficient
subset of the blocks is received the clients are able to recon-
stitute the original data block. If a client misses a number
of blocks from a particular stripe for a period of time (while
the stripe multicast tree is being repaired after an internal
node has failed) the client can still reconstitute the original
data blocks due to the redundancy. An interesting alter-
native is the use of rateless codes [24, 26], which provide
a simple approach to coordinating redundancy, both across
stripes and within each stripe.
Applications also control when to create and tear down a
SplitStream forest. Our experimental results indicate that
the maximum node stress to construct a forest and distribute
1 Mbyte of data is significantly lower than the node stress
placed on a centralized server distributing the same data.
Therefore, it is perfectly reasonable to create a forest to
distribute a few megabytes of data and then tear it down.
The results also show that the overhead to maintain a forest
is low even with high churn. Therefore, it is also reasonable
to create long-lived forests. The ideal strategy depends on
the fraction of time that a forest is used to transmit data.
2.4 Properties
Next, we discuss necessary and sufficient conditions for
the feasibility of forest construction by any algorithm and
relate them with what SplitStream can achieve.
Let N be the set of nodes and k be the number of stripes.
Each node i ∈ N wants to receive Ii (0 < Ii ≤ k) distinct
stripes and is willing to forward a stripe to up to Ci other
nodes. We call Ii the node’s desired indegree and Ci its
forwarding capacity. There is a set of source nodes (S ⊆ N)
whose elements originate one or more of the k stripes (i.e.,
1 ≤ |S| ≤ k). The forwarding capacity Cs of each source
node s ∈ S must at least equal the number of stripes that s
originates, Ts.
Definition 1. Given a set of nodes N and a set of sources
S ⊆ N , forest construction is feasible if it is possible to con-
nect the nodes such that each node i ∈ N receives Ii distinct
stripes and has no more than Ci children.
The following condition is obviously necessary for the fea-
sibility of forest construction by any algorithm.
Condition 1. If forest construction is feasible, the sum
of the desired indegrees cannot exceed the sum of the for-
warding capacities:
∑
∀i∈N
Ii ≤
∑
∀i∈N
Ci (1)
Condition 1 is necessary but not sufficient for the feasibil-
ity of forest construction, as the simple example in Figure 2
illustrates. The incoming arrows in each node in the figure
correspond to its desired indegree and the outgoing arrows
correspond to its forwarding capacity. The total forwarding
capacity matches the total desired indegree in this example
but it is impossible to supply both of the rightmost nodes
with two distinct stripes. The node with forwarding capac-
ity three has desired indegree one and, therefore, it can only
provide the same stripe to all its children.
Sources
Figure 2: An example illustrating that condition 1
is not sufficient to ensure feasibility of a SplitStream
forest.
Condition 2 prevents this problem. It is sufficient to en-
sure feasibility because it prevents the concentration of for-
warding capacity in nodes that are unable to forward all
stripes.
Condition 2. A sufficient condition for the feasibility of
forest construction is for Condition 1 to hold and for all
nodes whose forwarding capacity exceeds their desired inde-
gree to receive or originate all k stripes, i.e.,
∀i : Ci > Ii ⇒ Ii + Ti = k. (2)
This is a natural condition in a cooperative environment
because nodes are unlikely to spend more resources improv-
ing the quality of service perceived by others than on im-
proving the quality of service that they perceive. Addition-
ally, inbound bandwidth is typically greater than or equal
to outbound bandwidth in consumer Internet connections.
Given a set of nodes that satisfy Condition 2, the Split-
Stream algorithm can build a forest with very high proba-
bility provided there is a modest amount of spare capacity
in the system. The probability of success increases with
the minimum number of stripes that nodes receives, Imin,
and the total amount of spare capacity, C =
∑
∀i∈N Ci −∑
∀i∈N Ii. We derive the following rough upper bound on
the probability of failure in Section 4.5:
|N | × k × (1− Imin
k
)
C
k−1 (3)
As indicated by the upper bound formula, the probability
of success is very high even with a small amount of spare
capacity in the system. Additionally, we expect Imin to be
large for most applications. For example, erasure coding for
reliable distribution of data and MDC for video distribu-
tion perform poorly if peers do not receive to most stripes.
Therefore, we expect configurations where all peers receive
all stripes to be common. In this case, the algorithm can
guarantee efficient forest construction with probability one
even if there is no spare capacity.
In an open cooperative environment, it is important to
address the issue of free loaders, which appear to be preva-
lent in Gnutella [2]. In such an environment, it is desirable
to strengthen Condition 1 to require that the forwarding ca-
pacity of each node be greater than or equal to its desired
indegree (i.e., ∀i ∈ N : Ci ≥ Ii). (This condition may be un-
necessarily strong in more controlled settings, for example,
in a corporate intranet.) Additionally, we need a mecha-
nism to discourage free loading such that most participants
satisfy the stronger condition. In some settings, it may be
sufficient to have the SplitStream implementation enforce
the condition in the local node. Stronger mechanisms may
use a trusted execution platform like Microsoft’s Palladium
or a mechanism based on incentives [28]. This is an inter-
esting area of future work.
3. BACKGROUND
SplitStream is implemented using tree-based application-
level multicast. There are many proposals for how application-
level multicast trees can be built and maintained [8, 19, 40,
13, 32, 22, 6, 23]. In this paper, we consider the implemen-
tation of SplitStream using Scribe [13] and Pastry [33]. It
could also be implemented using a different overlay protocol
and group communication system; for example, Bayeux on
Tapestry [39, 40] or Scribe on CAN [15]. Before describ-
ing the SplitStream design, we provide a brief overview of
Pastry and Scribe.
3.1 Pastry
Pastry is a scalable, self-organizing structured peer-to-
peer overlay network similar to CAN [31], Chord [36], and
Tapestry [39]. In Pastry, nodes and objects are assigned ran-
dom identifiers (called nodeIds and keys, respectively) from
a large id space. NodeIds and keys are 128 bits long and can
be thought of as a sequence of digits in base 2b (b is a con-
figuration parameter with a typical value of 3 or 4). Given
a message and a key, Pastry routes the message to the node
with the nodeId that is numerically closest to the key, which
is called the key’s root. This simple capability can be used
to build higher-level services like a distributed hash table
(DHT) or an application-level group communication system
like Scribe.
In order to route messages, each node maintains a routing
table and a leaf set. A node’s routing table has about log2bN
rows and 2b columns. The entries in row r of the routing
table refer to nodes whose nodeIds share the first r digits
with the local node’s nodeId. The (r + 1)th nodeId digit of
a node in column c of row r equals c. The column in row
r corresponding to the value of the (r + 1)th digit of the
local node’s nodeId remains empty. At each routing step, a
node normally forwards the message to a node whose nodeId
shares with the key a prefix that is at least one digit longer
than the prefix that the key shares with the present node’s
id. If no such node is known, the message is forwarded to a
node whose nodeId shares a prefix with the key as long as
the current node’s nodeId but is numerically closer. Figure 3
shows the path of an example message.
Each Pastry node maintains a set of neighboring nodes in
the nodeId space (called the leaf set), both to ensure reliable
messag
本文档为【SplitStream-sosp】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。