Concentration inequalities and martingale
inequalities | a survey
Fan Chung �y Linyuan Lu zy
May 28, 2006
Abstract
We examine a number of generalized and extended versions of concen-
tration inequalities and martingale inequalities. These inequalities are ef-
fective for analyzing processes with quite general conditions as illustrated
in an example for an in�nite Polya process and webgraphs.
1 Introduction
One of the main tools in probabilistic analysis is the concentration inequalities.
Basically, the concentration inequalities are meant to give a sharp prediction
of the actual value of a random variable by bounding the error term (from the
expected value) with an associated probability. The classical concentration in-
equalities such as those for the binomial distribution have the best possible error
estimates with exponentially small probabilistic bounds. Such concentration in-
equalities usually require certain independence assumptions (i.e., the random
variable can be decomposed as a sum of independent random variables).
When the independence assumptions do not hold, it is still desirable to have
similar, albeit slightly weaker, inequalities at our disposal. One approach is
the martingale method. If the random variable and the associated probability
space can be organized into a chain of events with modi�ed probability spaces
and if the incremental changes of the value of the event is \small", then the
martingale inequalities provide very good error estimates. The reader is referred
to numerous textbooks [5, 17, 20] on this subject.
In the past few years, there has been a great deal of research in analyzing
general random graph models for realistic massive graphs which have uneven
degree distribution such as the power law [1, 2, 3, 4, 6]. The usual concentration
inequalities and martingale inequalities have often been found to be inadequate
and in many cases not feasible. The reasons are multi-fold: Due to uneven
degree distribution, the error bound of those very large degrees o�set the delicate
�University of California, San Diego, fan@ucsd.edu
yResearch supported in part by NSF Grants DMS 0100472 and ITR 0205061
zUniversity of South Carolina, lu@math.sc.edu
1
analysis in the sparse part of the graph. For the setup of the martingales, a
uniform upper bound for the incremental changes are often too poor to be of
any use. Furthermore, the graph is dynamically evolving and therefore the
probability space is changing at each tick of the time.
In spite of these di�culties, it is highly desirable to extend the classical con-
centration inequalities and martingale inequalities so that rigorous analysis for
random graphs with general degree distributions can be carried out. Indeed, in
the course of studying general random graphs, a number of variations and gen-
eralizations of concentration inequalities and martingale inequalities have been
scattered around. It is the goal of this survey to put together these extensions
and generalizations to present a more complete picture. We will examine and
compare these inequalities and complete proofs will be given. Needless to say
that this survey is far from complete since all the work is quite recent and the
selection is heavily influenced by our personal learning experience on this topic.
Indeed, many of these inequalities have been included in our previous papers
[9, 10, 11, 12].
In addition to numerous variations of the inequalities, we also include an
example of an application on a generalization of Polya’s urn problem. Due
to the fundamental nature of these concentration inequalities and martingale
inequalities, they may be useful for many other problems as well.
This paper is organized as follows:
1. Introduction | overview, recent developments and summary.
2. Binomial distribution and its asymptotic behavior | the normalized bi-
nomial distribution and Poisson distribution,
3. General Cherno� inequalities | sums of independent random variables in
�ve di�erent concentration inequalities.
4. More concentration inequalities | �ve more variations of the concentra-
tion inequalities.
5. Martingales and Azuma’s inequality | basics for martingales and proofs
for Azuma’s inequality.
6. General martingale inequalities | four general versions of martingale in-
equalities with proofs.
7. Supermartingales and submartingales | modifying the de�nitions for
martingale and still preserving the e�ectiveness of the martingale inequal-
ities.
8. The decision tree and relaxed concentration inequalities | instead of the
worst case incremental bound (the Lipschitz condition), only certain ‘local’
conditions are required.
9. A generalized Polya’s urn problem | An application for an in�nite Polya
process by using these general concentration and martingale inequalities.
2
For webgraphs generated by the preferential attachment scheme, the con-
centration for the power law degree distribution can be derived in a similar
way.
2 The binomial distribution and its asymptotic
behavior
Bernoulli trials, named after James Bernoulli, can be thought of as a sequence
of coin flips. For some �xed value p, where 0 � p � 1, the outcome of the
coin tossing process has probability p of getting a \head". Let Sn denote the
number of heads after n tosses. We can write Sn as a sum of independent
random variables Xi as follows:
Sn = X1 + X2 + � � �+ Xn
where, for each i, the random variable X satis�es
Pr(Xi = 1) = p;
Pr(Xi = 0) = 1− p: (1)
A classical question is to determine the distribution of Sn. It is not too di�cult
to see that Sn has the binomial distribution B(n; p):
Pr(Sn = k) =
�
n
k
�
pk(1− p)n−k; for k = 0; 1; 2; : : : ; n:
The expectation and variance of B(n; p) are
E(Sn) = np; Var(Sn) = np(1− p):
To better understand the asymptotic behavior of the binomial distribution,
we compare it with the normal distribution N(�; �), whose density function is
given by
f(x) =
1p
2��
e−
(x−�)2
2�2 ; −1 < x < 1
where � denotes the expectation and �2 is the variance.
The case N(0; 1) is called the standard normal distribution whose density
function is given by
f(x) =
1p
2�
e−x
2=2; −1 < x < 1:
When p is a constant, the limit of the binomial distribution, after scaling,
is the standard normal distribution and can be viewed as a special case of the
Central-Limit Theorem, sometimes called the DeMoivre-Laplace limit Theorem
[15].
3
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
4600 4800 5000 5200 5400
Pr
ob
ab
ilit
y
value
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-10 -5 0 5 10
Pr
ob
ab
ilit
y
de
ns
ity
value
Figure 1: The Binomial distribution
B(10000; 0:5)
Figure 2: The Standard normal distri-
bution N(0; 1)
Theorem 1 The binomial distribution B(n; p) for Sn, as de�ned in (1), satis-
�es, for two constants a and b,
lim
n!1Pr(a� < Sn − np < b�) =
Z b
a
1p
2�
e−x
2=2dx
where � =
p
np(1− p) provided, np(1− p) !1 as n !1:
When np is upper bounded (by a constant), the above theorem is no longer
true. For example, for p = �n , the limit distribution of B(n; p) is the so-called
Poisson distribution P (�):
Pr(X = k) =
�k
k!
e−�; for k = 0; 1; 2; � � � :
The expectation and variance of the Poisson distribution P (�) is given by
E(X) = �; and Var(X) = �:
Theorem 2 For p = �n , where � is a constant, the limit distribution of binomial
distribution B(n; p) is the Poisson distribution P (�).
Proof: We consider
lim
n!1Pr(Sn = k) = limn!1
�
n
k
�
pk(1− p)n−k
= lim
n!1
�k
Qk−1
i=0 (1− in )
k!
e−p(n−k)
=
�k
k!
e−�:
�
4
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20
Pr
ob
ab
ilit
y
value
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20
Pr
ob
ab
ilit
y
value
Figure 3: The Binomial distribution
B(1000; 0:003)
Figure 4: The Poisson distribution
P (3)
As p decreases from �(1) to �( 1n ), the asymptotic behavior of the binomial
distribution B(n; p) changes from the normal distribution to the Poisson distri-
bution. (Some examples are illustrated in Figures 5 and 6). Theorem 1 states
that the asymptotic behavior of B(n; p) within the interval (np−C�; np + C�)
(for any constant C) is close to the normal distribution. In some applications,
we might need asymptotic estimates beyond this interval.
0
0.01
0.02
0.03
0.04
0.05
70 80 90 100 110 120 130
Pr
ob
ab
ilit
y
value
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Pr
ob
ab
ilit
y
value
Figure 5: The Binomial distribution
B(1000; 0:1)
Figure 6: The Binomial distribution
B(1000; 0:01)
3 General Cherno� inequalities
If the random variable under consideration can be expressed as a sum of indepen-
dent variables, it is possible to derive good estimates. The binomial distribution
is one such example where Sn =
Pn
i=1 Xi and Xi’s are independent and iden-
tical. In this section, we consider sums of independent variables that are not
necessarily identical. To control the probability of how close a sum of random
variables is to the expected value, various concentration inequalities are in play.
5
A typical version of the Cherno� inequalities, attributed to Herman Cherno�,
can be stated as follows:
Theorem 3 [8] Let X1; : : : ; Xn be independent random variables with E(Xi) =
0 and jXij � 1 for all i. Let X =
Pn
i=1 Xi and let �
2 be the variance of Xi.
Then
Pr(jX j � k�) � 2e−k2=4n;
for any 0 � k � 2�:
If the random variables Xi under consideration assume non-negative values,
the following version of Cherno� inequalities is often useful.
Theorem 4 [8] Let X1; : : : ; Xn be independent random variables with
Pr(Xi = 1) = pi; Pr(Xi = 0) = 1− pi:
We consider the sum X =
Pn
i=1 Xi, with expectation E(X) =
Pn
i=1 pi. Then
we have
(Lower tail) Pr(X � E(X)− �) � e−�2=2E(X);
(Upper tail) Pr(X � E(X) + �) � e− �
2
2(E(X)+�=3) :
We remark that the term �=3 appearing in the exponent of the bound for
the upper tail is signi�cant. This covers the case when the limit distribution is
Poisson as well as normal.
There are many variations of the Cherno� inequalities. Due to the funda-
mental nature of these inequalities, we will state several versions and then prove
the strongest version from which all the other inequalities can be deduced. (See
Figure 7 for the flowchart of these theorems.) In this section, we will prove
Theorem 8 and deduce Theorems 6 and 5. Theorems 10 and 11 will be stated
and proved in the next section. Theorems 9, 7, 13, 14 on the lower tail can be
deduced by reflecting X to −X .
The following inequality is a generalization of the Cherno� inequalities for
the binomial distribution:
Theorem 5 [9] Let X1; : : : ; Xn be independent random variables with
Pr(Xi = 1) = pi; Pr(Xi = 0) = 1− pi:
For X =
Pn
i=1 aiXi with ai > 0, we have E(X) =
Pn
i=1 aipi and we de�ne
� =
Pn
i=1 a
2
i pi. Then we have
Pr(X � E(X)− �) � e−�2=2� (2)
Pr(X � E(X) + �) � e− �
2
2(�+a�=3) (3)
where a = maxfa1; a2; : : : ; ang.
6
Theorem 8 Theorem 9
Theorem 6 Theorem 7
Theorem 5
Theorem 4
Theorem 10 Theorem 13 Theorem 14Theorem 11
Upper tails Lower tails
Figure 7: The flowchart for theorems on the sum of independent variables.
To compare inequalities (2) to (3), we consider an example in Figure 8. The
cumulative distribution is the function Pr(X > x). The dotted curve in Figure 8
illustrates the cumulative distribution of the binomial distribution B(1000; 0:1)
with the value ranging from 0 to 1 as x goes from −1 to 1. The solid curve
at the lower-left corner is the bound e−�
2=2� for the lower tail. The solid curve
at the upper-right corner is the bound 1− e− �
2
2(�+a�=3) for the upper tail.
0
0.2
0.4
0.6
0.8
1
70 80 90 100 110 120 130
Cu
m
ul
at
ive
P
ro
ba
bi
lity
value
Figure 8: Cherno� inequalities
The inequality (3) in the above theorem is a corollary of the following general
concentration inequality (also see Theorem 2.7 in the survey paper by McDi-
armid [20]).
7
Theorem 6 [20] Let Xi (1 � i � n) be independent random variables satisfying
Xi � E(Xi) + M , for 1 � i � n. We consider the sum X =
Pn
i=1 Xi with
expectation E(X) =
Pn
i=1 E(Xi) and variance Var(X) =
Pn
i=1 Var(Xi). Then
we have
Pr(X � E(X) + �) � e− �
2
2(Var(X)+M�=3) :
In the other direction, we have the following inequality.
Theorem 7 If X1; X2; : : : ; Xn are non-negative independent random variables,
we have the following bounds for the sum X =
Pn
i=1 Xi:
Pr(X � E(X)− �) � e−
�2
2
Pn
i=1 E(X
2
i
) :
A strengthened version of the above theorem is as follows:
Theorem 8 Suppose Xi are independent random variables satisfying Xi � M ,
for 1 � i � n. Let X = Pni=1 Xi and kXk = pPni=1 E(X2i ). Then we have
Pr(X � E(X) + �) � e− �
2
2(kXk2+M�=3) :
Replacing X by−X in the proof of Theorem 8, we have the following theorem
for the lower tail.
Theorem 9 Let Xi be independent random variables satisfying Xi � −M , for
1 � i � n. Let X = Pni=1 Xi and kXk = pPni=1 E(X2i ). Then we have
Pr(X � E(X)− �) � e− �
2
2(kXk2+M�=3) :
Before we give the proof of Theorems 8, we will �rst show the implications
of Theorems 8 and 9. Namely, we will show that the other concentration in-
equalities can be derived from Theorems 8 and 9.
Fact: Theorem 8 =) Theorem 6:
Proof: Let X 0i = Xi − E(Xi) and X 0 =
Pn
i=1 X
0
i = X − E(X). We have
X 0i � M for 1 � i � n:
We also have
kX 0k2 =
nX
i=1
E(X 02i )
=
nX
i=1
Var(Xi)
= Var(X):
8
Applying Theorem 8, we get
Pr(X � E(X) + �) = Pr(X 0 � �)
� e− �
2
2(kX0k2+M�=3)
� e− �
2
2(Var(X)+M�=3) :
�
Fact: Theorem 9 =) Theorem 7
The proof is straightforward by choosing M = 0.
Fact: Theorem 6 and 7 =) Theorem 5
Proof: We de�ne Yi = aiXi. Note that
kXk2 =
nX
i=1
E(Y 2i ) =
nX
i=1
a2i pi = �:
Equation (2) follows from Theorem 7 since Yi’s are non-negatives.
For the other direction, we have
Yi � ai � a � E(Yi) + a:
Equation (3) follows from Theorem 6. �
Fact: Theorem 8 and Theorem 9 =) Theorem 3
The proof is by choosing Y = X −E(X), M = 1 and applying Theorems 8 and
Theorem 9 to Y .
Fact: Theorem 5 =) Theorem 4
The proof follows by choosing a1 = a2 = � � � = an = 1.
Finally, we give the complete proof of Theorem 8 and thus �nish the proofs
for all the above theorems on Cherno� inequalities.
Proof of Theorem 8: We consider
E(etX) = E(et
P
i Xi) =
nY
i=1
E(etXi)
since the Xi’s are independent.
We de�ne g(y) = 2
P1
k=2
yk−2
k! =
2(ey−1−y)
y2 , and use the following facts about
g:
� g(0) = 1.
� g(y) � 1, for y < 0.
� g(y) is monotone increasing, for y � 0.
9
� For y < 3, we have
g(y) = 2
1X
k=2
yk−2
k!
�
1X
k=2
yk−2
3k−2
=
1
1− y=3
since k! � 2 � 3k−2.
Then we have, for k � 2,
E(etX) =
nY
i=1
E(etXi)
=
nY
i=1
E(
1X
k=0
tkXki
k!
)
=
nY
i=1
E(1 + tE(Xi) +
1
2
t2X2i g(tXi))
�
nY
i=1
(1 + tE(Xi) +
1
2
t2E(X2i )g(tM))
�
nY
i=1
etE(Xi)+
1
2 t
2E(X2i )g(tM)
= etE(X)+
1
2 t
2g(tM)
Pn
i=1 E(X
2
i )
= etE(X)+
1
2 t
2g(tM)kXk2 :
Hence, for t satisfying tM < 3, we have
Pr(X � E(X) + �) = Pr(etX � etE(X)+t�)
� e−tE(X)−t�E(etX)
� e−t�+ 12 t2g(tM)kXk2
� e−t�+ 12 t2kXk2 11−tM=3 :
To minimize the above expression, we choose t = �kXk2+M�=3 . Therefore, tM <
3 and we have
Pr(X � E(X) + �) � e−t�+ 12 t2kXk2 11−tM=3
= e−
�2
2(kXk2+M�=3) :
The proof is complete. �
4 More concentration inequalities
Here we state several variations and extensions of the concentration inequalities
in Theorem 8. We �rst consider the upper tail.
10
Theorem 10 Let Xi denote independent random variables satisfying Xi �
E(Xi) + ai + M , for 1 � i � n. For, X =
Pn
i=1 Xi, we have
Pr(X � E(X) + �) � e−
�2
2(Var(X)+
Pn
i=1 a
2
i
+M�=3) :
Proof: Let X 0i = Xi − E(Xi)− ai and X 0 =
Pn
i=1 X
0
i. We have
X 0i � M for 1 � i � n:
X 0 − E(X 0) =
nX
i=1
(X 0i − E(X 0i))
=
nX
i=1
(X 0i + ai)
=
nX
i=1
(Xi − E(Xi))
= X − E(X):
Thus,
kX 0k2 =
nX
i=1
E(X 02i )
=
nX
i=1
E((Xi − E(Xi)− ai)2)
=
nX
i=1
E((Xi − E(Xi))2 + a2i
= Var(X) +
nX
i=1
a2i :
By applying Theorem 8, the proof is �nished. �
Theorem 11 Suppose Xi are independent random variables satisfying Xi �
E(Xi) + Mi, for 0 � i � n. We order the Xi's so that the Mi are in increasing
order. Let X =
Pn
i=1 Xi. Then for any 1 � k � n, we have
Pr(X � E(X) + �) � e−
�2
2(Var(X)+
Pn
i=k(Mi−Mk)2+Mk�=3) :
Proof: For �xed k, we choose M = Mk and
ai =
�
0 if 1 � i � k;
Mi −Mk if k � i � n:
We have
Xi − E(Xi) � Mi � ai + Mk for 1 � k � n;
11
nX
i=1
a2i =
nX
i=k
(Mi −Mk)2:
Using Theorem 10, we have
Pr(Xi � E(X) + �) � e
− �2
2(Var(X)+
Pn
i=k(Mi−Mk)2+Mk�=3) :
�
Example 12 Let X1; X2; : : : ; Xn be independent random variables. For 1 �
i � n− 1, suppose Xi follows the same distribution with
Pr(Xi = 0) = 1− p and Pr(Xi = 1) = p;
and Xn follows the distribution with
Pr(Xn = 0) = 1− p and Pr(Xn =
p
n) = p:
Consider the sum X =
Pn
i=1 Xi.
We have
E(X) =
nX
i=1
E(Xi)
= (n− 1)p +pnp:
Var(X) =
nX
i=1
Var(Xi)
= (n− 1)p(1− p) + np(1− p)
= (2n− 1)p(1− p):
Apply Theorem 6 with M = (1− p)pn. We have
Pr(X � E(X) + �) � e− �
2
2((2n−1)p(1−p)+(1−p)pn�=3) :
In particular, for constant p 2 (0; 1) and � = �(n 12+�), we have
Pr(X � E(X) + �) � e−�(n�):
Now we apply Theorem 11 with M1 = : : : = Mn−1 = (1 − p) and Mn =p
n(1− p). Choosing k = n− 1, we have
Var(X) + (Mn −Mn−1)2 = (2n− 1)p(1− p) + (1− p)2(
p
n− 1)2
� (2n− 1)p(1− p) + (1− p)2n
� (1 − p2)n:
12
Thus,
Pr(Xi � E(X) + �) � e−
�2
2((1−p2)n+(1−p)2�=3) :
For constant p 2 (0; 1) and � = �(n 12+�), we have
Pr(X � E(X) + �) � e−�(n2�):
From the above examples, we note that Theorem 11 gives a signi�cantly
better bound than that in Theorem 6 if the random variables Xi have very
di�erent upper bounds.
For completeness, we also list the corresponding theorems for the lower tails.
(These can be derived by replacing X by −X .)
Theorem 13 Let Xi denote independent random variables satisfying Xi �
E(Xi)− ai −M , for 0 � i � n. For X =
Pn
i=1 Xi, we have
Pr(X � E(X)− �) � e−
�2
2(Var(X)+
Pn
i=1 a
2
i
+M�=3) :
Theorem 14 Let Xi denote independent random variables satisfying Xi �
E(Xi)−Mi, for 0 � i � n. We order the Xi's so that the Mi are in increasing
order. Let X =
Pn
i=1 Xi. Then for any 1 � k � n, we have
Pr(X � E(X)− �) � e−
�2
2(Var(X)+
Pn
i=k(Mi−Mk)2+Mk�=3) :
Continuing the above example, we choose M1 = M2 = : : : = Mn−1 = p, and
Mn =
p
np. We choose k = n− 1, so we have
Var(X) + (Mn −Mn−1)2 = (2n− 1)p(1− p) + p2(
p
n− 1)2
� (2n− 1)p(1− p) + p2n
� p(2− p)n:
Using Theorem 14, we have
Pr(X � E(X)− �) � e− �
2
2(p(2−p)n+p2�=3) :
For a constant p 2 (0; 1) and � = �(n 12+�), we have
Pr(X � E(X)− �) � e−�(n2�):
5 Martingales and Azuma's inequality
A martingale is a sequence of random variables X0; X1; : : : with �nite means
such that the conditional expectation of Xn+1 given X0; X1; : : : ; Xn is equal to
Xn.
The above de�nition is given in the classical book of Feller [15], p. 210.
However, the conditional expectation depends on the random variables under
13
consideration and can be di�cult to deal with in various cases. In this survey
we will use the following de�nition which is concise and basically equivalent for
the �nite cases.
Suppose that Ω is a probability space with a probability distribution p. Let
F denote a �-�eld on Ω. (A �-�eld on Ω is a collection of subsets of Ω which
contains ; and Ω, and is closed under unions, intersections, and complementa-
tion.) In a �-�eld F of Ω, the smallest set in F containing an element x is the
intersection of all sets in F containing x. A function f : Ω ! R is said to be
F -measurable if f(x) = f(y) for any y in the smallest set containing x. (For
more terminology on martingales, the reader is referred to [17].)
If f : Ω ! R is a function, we de�ne the expectation E(f) = E(f(x) j x 2 Ω)
by
E(f) = E(f(x) j x 2 Ω) :=
X
x2Ω
f(x)p(x):
If F is a �-�eld on Ω, we de�ne the conditional expectation E(f j F) : Ω ! R
by the formula
E(f j F)(x) := 1P
y2F(x) p(y)
X
y2F(x)
f(y)p(y)
where F(x) is the sm
本文档为【concentration inequalities】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。