Accelerated Gradient Methods for Networked Optimization
Euhanna Ghadimi, Mikael Johansson and Iman Shames
Abstract—This paper explores the use of accelerated gra-
dient methods in networked optimization. Optimal algorithm
parameters and associated convergence rates are derived for
distributed resource allocation and consensus problems, and the
practical performance of the accelerated gradient algorithms
are shown to outperform alternatives in the literature. Since
the optimal parameters for the accelerated gradient method
depends on upper and lower bounds of the Hessian, we study
how errors in these estimates influence the convergence rate
of the algorithm. This analysis identifies, among other things,
cases where erroneous estimates of the Hessian bounds cause
the accelerated method to have slower convergence than the
corresponding (non-accelerated) gradient method. An applica-
tion to Internet congestion control illustrates these issues.
I. INTRODUCTION
Distributed optimization has recently attracted a significant
attention from several different research communities. Exam-
ples include the work on network utility maximization for re-
source allocation in communication networks [1], distributed
coordination of multi-agent systems [2], and collaborative
estimation and event detection in wireless sensor networks
(WSNs) [3] and many others. The majority of these praxes
rely on the application of gradient or subgradient methods
to the dual formulation of the decision problem at hand
(cf. [1]). Although gradient methods are easy to implement
and require modest computations, they suffer from slow
convergence rates. In certain special cases, such as the cele-
brated development of distributed power control algorithms
for cellular phones [4], one can replace gradient methods
by fixed-point iterations and achieve improved convergence
rates. For other problems, such as average consensus [5],
a number of heuristic methods have been proposed that
improve the convergence rate of the standard method [6], [7].
However, we are not interested in techniques that apply only
in special cases. We would like to develop general-purpose
schemes that retain the simplicity and the applicability of the
gradient method while improving the convergence rates.
In the optimization literature, there are essentially two
ways of accelerating the convergence rate of the gradient
methods. One is to use higher-order methods, such as
Newton’s method [8]. Although distributed Newton methods
have recently been developed for special problem classes,
they impose large communication overhead. Another way to
improve convergence is to use multi-step methods [9], [8].
These methods only rely on gradient information (and can
hence often be implemented based on local information) but
use a history of past iterates to compute the next.
The authors are with the ACCESS Linnaeus Center, Electrical En-
gineering, Royal Institute of Technology, Stockholm, Sweden. E-mails:
{euhanna, mikaelj, imansh}@ee.kth.se
This paper explores how multi-step methods can be used
in networked optimization. Our initial focus is to attempt to
accelerate the projected gradient method of [10]. We derive
optimal parameters for the algorithm and show how these
are directly related to the topology of the underlying com-
munication graph. Contrary to the gradient descent method,
which only needs an upper bound on the Hessian for finding
a step size that ensures convergence, the multi-step method
need both the upper and the lower bounds on the Hessian.
Due to this fact, we formally analyze the convergence and
compute the convergence rate in the presence of estimation
errors in the Hessian bounds. Later, we illustrate how the
technique allows us to derive accelerated consensus iterations
and demonstrate improved convergence rates relative to other
acceleration schemes proposed in the literature [11], [6].
Finally, we implement the techniques that we have developed
to accelerate the network flow control algorithm described in
[12]. We discuss how uncertainties in finding the bounds on
Hessian can affect the convergence rate of the algorithm.
The rest of the paper is organized as follows. In Section
II, we review distributed resource allocation and multi-step
gradient techniques. In Section III we present our accelerated
resource allocation algorithm with proofs of convergence
and performance comparison with the basic scaled gradient
method. In Section IV, robustness analysis of multi-step
algorithm in the presence of uncertainty is presented. Section
V is devoted to other applications and considers accelerated
consensus and network flow control. Conclusion remarks and
future work outlook are presented in Section VI.
II. PRELIMINARIES
We consider constrained optimization problems on a net-
work of nodes. The network is modeled as a graph G (V ,E )
with vertices (nodes) in the set V = {1,2, ..,n} and pairs
of nodes as edges in the set E ⊆ V × V . We use Ni =
{ j|(i, j) ∈ E } to denote the set of neighbors of node i.
A. Resource Allocation Under Total Budget Constraint
Consider the following resource allocation problem
minx ∑
n
i=1 fi(xi)
subject to ∑
n
i=1 xi = xtot
(1)
Where ∑
n
i=1 xi = xtot is called the budget constraint, and fi are
real-valued strictly convex and twice differentiable functions
whose second derivatives satisfy
Li ≤ f ′′i (xi)≤Ui, i = 1, ..,n, (2)
for known constants Ui ≥ Li > 0. A distributed resource
allocation mechanism that maintains the budget constraint
2011 American Control Conference
on O'Farrell Street, San Francisco, CA, USA
June 29 - July 01, 2011
978-1-4577-0081-1/11/$26.00 ©2011 AACC 1668
at all times was developed in [10], [13]. nodes iteratively
exchange resources via the scaled gradient method
xk+1 = xk −W∇ f (xk) (3)
The weight matrix W needs to satisfy 1TW = 0, W1 = 0
for the total resource constraint to be always satisfied, and
needs to have the same sparsity pattern as the underlying
graph to ensure that nodes only exchange resources with
neighbors. One matrix that satisfies these constraints on W is
the Laplacian of the underlying graph L =A(G )A(G )T [14].
Here, A(G ) is the adjacency matrix of G with entries 1
when (i, j) ∈ E , −1 when ( j, i) ∈ E and 0 otherwise. To
ensure convergence of the iteration (3), the Laplacian has to
be appropriately scaled W =−δL for some δ > 0.
The Laplacian weights relevant to a specific node can be
determined using local topology information. Specifically,
[W ]i j =
δ , (i, j) ∈ E ,
−δdi i = j,
0 otherwise.
(4)
Where di is the degree of node i. Following the notation
in [13], we call these constant weights, since all edges
are assigned a constant weight and then Wii is adjusted to
make sure that W1= 0. Several heuristic weight choices are
introduced in [13] . For the special case when Ui = 1 for
all i∈N , this includes: the maximum degree weights where
δ = 1/maxi di; the best constant weights, for which δ =
2/(λ2(L )+λn(L )); and the Metropolis-Hasting weights
[W ]i j =
−min{1/di,1/d j} (i, j) ∈ ε,
−∑(i,k)∈ε Wik i = j,
0 otherwise.
(5)
We will return to this method in Section III and see how
accelerated gradient techniques, described next, allow to
achieve improved convergence rates compared to (3).
B. Multi-step Gradient Methods
The classical gradient method for minimization of a con-
vex function f takes the form
xk+1 = xk −α∇ f (xk) (6)
for some step size parameter α > 0. This method is a suitable
choice for distributed optimization due to its simplicity and
ease of implementation. However, gradient based algorithms
often exhibit slow convergence rate [9]. The convergence
rate can be improved by accounting for the history of the
process which is already obtained in the preceding iterations.
Methods in which the new approximation depends on the
preceding ones are called multi-step methods. In this paper,
we use two-step iterations of the form
xk+1 = xk −α∇ f (xk)+β (xk − xk−1) (7)
where α > 0, β ≥ 0 are fixed step sizes. This method, pro-
posed for centralized applications by Polyak [9] is called the
Heavy Ball (HB) method and can be tuned to have smoother
trajectory toward the local minimum point compared with
traditional the gradient iterations [9]. The smoother trajectory
translates to faster convergence rates. The reason why we
focus on this method is that it is computationally simple and
does not need higher order information which might not be
locally available in distributed applications.
III. ACCELERATED RESOURCE ALLOCATION
Our first contribution in this paper is to consider the
application of multi-step methods to the distributed resource
allocation problem (1). To this end, consider the iteration
xk+1 = xk −αW∇ f (xk)+β (xk − xk−1) (8)
Let x⋆ be an optimal point of f (x). Using Taylor expansion
∇ f (xk) = ∇
2 f (x⋆)(xk − x⋆)+o(xk − x⋆)2
Letting zk+1 = (xk+1− x⋆, xk − x⋆), we can rewrite (8) as
zk+1 = A zk +o(zk) (9)
where the 2n×2n-square matrix A is given by
A =
[
(1+β )I−αWH −β I
I 0
]
, H = ∇2 f (x⋆) (10)
Let L= λ1(H)≤ λ2(H)≤ ...≤ λn(H) =U be the eigenvalues
of H. In what follows we study the local convergence of the
iterative process described by (8).
Theorem 1: Let ω =WH and λn(ω) be the largest eigen-
value of ω and x⋆ be a nonsingular minimum point of
f (x),x ∈ Rn. Then for
0≤ β < 1, 0< α < 2(1+β )
λn(ω)
, LI ≤∇2 f (x⋆)≤UI (11)
the method (8) converges to x⋆ with the rate of geometric
progression:
||xk+1− x⋆||
||xk − x⋆|| ≤ q1 0≤ q1 < 1.
Proof: For brevity we omit the proofs. The interested
reader may refer to [15].
The next theorem gives the optimal step sizes as well as the
optimum convergence rate of (8).
Theorem 2: The convergence rate of (8) is given by
q1=max
{
2
√
β , |1+β −αλ2(ω)| , |1+β −αλn(ω)|
}
−
√
β
(12)
The minimal value of q1,
q⋆1 =
√
λn(ω)−
√
λ2(ω)√
λn(ω)+
√
λ2(ω)
(13)
is obtained for the step sizes α = α⋆,β = β ⋆ with
α⋆ =
4(√
λn(ω)+
√
λ2(ω)
)2
β ⋆ =
(√
λn(ω)−
√
λ2(ω)√
λn(ω)+
√
λ2(ω)
)2
(14)
It is known that the convergence rate of the non-
accelerated method is (cf. [9], [8], [13])
q2 =
λn(ω)−λ2(ω)
λn(ω)+λ2(ω)
(15)
1669
0 10 20 30 40 50 60 70 80 90 100
10
í4
10
í2
10
0
10
2
10
3
10
4
t
f(
x
(t
))
í
f*
XiaoíBoydímetro
XiaoíBoydímax degree
XiaoíBoydíbest const
HBímetro
HBímax degree
HBíbest const
Fig. 1. Convergence behavior of HB and Xiao and Boyd method using
randomly generated network and all the heuristic weights. plot shows the
objective function values f (x(t))− f ⋆ versus iteration number t.
One can verify that q1≤ q2, and that the improvement in con-
vergence rate is depends on the quantity κ = λn(ω)/λ2(ω).
In particular, when κ is large, then the speed-up is roughly
proportional to
√
κ (cf. the analysis for the centralized heavy-
ball method in [9]). Large values of κ essentially appear for
two reasons: one is a large spread in the values Ui between
nodes, and the other is the topology of the underlying graph.
Assume for simplicity that Ui = 1 for all i, so that ω =W and
consider Laplacian weights. It is well-known from spectral
graph theory [14] that the complete graph with n vertices
has λ2(L ) = λn(L ) = n/(n− 1), so κ = 1 and there is
no real advantage of using the accelerated scheme. On the
other hand, for a ring network of n nodes, the eigenvalues
of the Laplacian are 1− cos 2pik
n
for k = 0, ....,n− 1, which
means that κ grows quickly with n, and the performance
improvements of the accelerated methods can be substantial.
1) Numerical Examples: In this section we numeri-
cally compare the performance of the accelerated and non-
accelerated resource allocation method under various weight
matrices described in the previous section. To make a fair
comparison, we consider a random graph generated in a
similar way as in [13]. The network shown in the Fig. 1
consists of 20 nodes, each of degree three. Edges are
bidirectional and the objective function at each node has the
form fi(xi) =
1
2
ai(xi − ci)2 + log[1+ exp(bi(xi − di))], i =
1, ..,n. The coefficients ai,bi,ci, anddi are drawn uniformly
from the intervals [0,2], [−2,2], [−10,10]and [−10,10], re-
spectively, and kept the same for all simulations. As initial
values, we fix the sum of variables to zero (i.e., ∑
n
i=1 xi = 0).
The second derivative of the functions fi are positive and
lower and upper bounded are given by li = ai, ui = ai +
1
4
b2i , i = 1, ...,n. Using the techniques in [13], we set the
parameters δ = −0.1251 for maximum degree weight and
δ ⋆ =−0.2030 for best constant weight scheme. Fig. 1 shows
the objective value minus the optimal value as function of
iteration count. The plot shows that the accelerated gradient
method yields a significantly increased convergence rate.
IV. ROBUSTNESS ANALYSIS
When the upper and lower bounds on the Hessian are
known and their ratio is significant, multi-step methods give
a considerable increase in convergence rate over the standard
gradient iteration. However such upper and lower bounds
are sometimes hard to estimate accurately in practice. It is
therefore important to analyze the sensitivity of multi-step
methods to errors in the Hessian bounds to assess if the
performance benefits prevail if the bounds are inaccurate.
Such a robustness analysis will be performed next.
Let L and U be the true upper and lower bounds of the
Hessian of the objective function, and let Lˆ and Uˆ be the esti-
mated bounds used when tuning the gradient and accelerated
gradient methods. We would like to observe for which values
of Lˆ and Uˆ the two methods converge under their “optimal”
step-size rules and compare the associated convergence rates.
In [9] sufficient conditions for the convergence of gradient
iterates of smooth functions are given. According to [9,
Theorem 3], for fixed step size 0 < α < 2/U , the gradient
algorithm converges with rate
q2 =max{|1−αL|, |1−αU |< 1} .
The minimum value q⋆2 = (U−L)/(U +L) is attained by the
optimal step size α⋆ = 2/(L+U). Together with the analysis
in Theorem 1 this yields the following observation:
Proposition 1: Consider 0 < Lˆ < Uˆ to be the erroneous
estimates of the Hessian bounds L,U respectively. For all
values of Lˆ,Uˆ fulfilling the condition U < Uˆ + Lˆ both the
gradient iteration (6) with step size selection
αˆ = 2/(Lˆ+Uˆ)
and the HB algorithm (7) with parameters
αˆ = 4/(
√
Lˆ+
√
Uˆ)2, βˆ = ((
√
Uˆ −
√
Lˆ)/(
√
Uˆ +
√
Lˆ))2
converge to the optimum of f (x).
According to this proposition, the convergence of both meth-
ods is rather similar. To compare convergence rates, we start
by presenting the following lemma.
Lemma 1: For parameters Lˆ,Uˆ satisfying 0 q
⋆
2 for either of two cases, i.e.
the gradient method with misestimated Hessian bounds has
a slower convergence rate than the optimally tuned one. The
best step size choice unexpectedly happens when Lˆ+ Uˆ =
L+U , for which qˆ2 = q
⋆
2.
On the other hand, Theorem 2 establishes the convergence
rate of HB with arbitrary (uncertain) step sizes to be
qˆ1 =max
{√
βˆ , |1+ βˆ − αˆL|−
√
βˆ , |1+ βˆ − αˆU |−
√
βˆ
}
.
(17)
where αˆ and βˆ are the values of the optimal step size rule
when the erroneous Hessian bounds are used. It is interesting
to note that for Lˆ = Uˆ and the convergence constraint, Lˆ+
1670
Uˆ >U , both methods converges with rate 1−L/Lˆ. However,
when the Hessian bounds are not known, the Heavy Ball
method can either perform better or worse than the gradient
method. We have the following results
Proposition 2: Assuming parameters Lˆ > L,Uˆ > U . then
convergence rate of the Heavy Ball method is qˆ1 = 1+ βˆ −
αˆl− βˆ 1/2 which is faster than the gradient alternative.
Proposition 3: Assuming parameters Lˆ < L,Uˆ > U and
Lˆ+Uˆ = L+U . Then the convergence rate of the Heavy Ball
method is given by qˆ1 = βˆ
1/2. Moreover, if (Uˆ/Lˆ)1/2 >U/L,
then this rate is slower than the gradient alternative.
From our simulation experience, the situation described in
Proposition 3 is rather singular. In fact, we have not yet
experienced this situation in non-contrived scenarios.
V. FURTHER APPLICATIONS
A. Accelerated Consensus
Distributed algorithms for consensus seeking, first pro-
posed in Tsitsiklis et al. [5], have been heavily researched
during the last few of years. We consider consensus via linear
iterations on the form
xi(k+1) =Wiixi(k)+ ∑
j∈Ni
Wi jx j(k), i = 1, ...,n, (18)
Here, Wi j is the weight on x j assigned in node i. In [16]
necessary and sufficient conditions on W for (18) to converge
to the average of initial values are given. More specifically, it
is shown that such matrices W have their largest eigenvalue
equal to 1 while their second largest eigenvalue is strictly
less than 1 and determines the asymptotic convergence factor
towards consensus [16]. For symmetric weights, [16] derived
a semi-definite program (SDP) for finding the weight matrix
W with maximized convergence rate and proposed several
simple heuristics for finding suboptimal weights, including
constant and Metropolis-Hastings weights (cf. [16], [2]). In
what follows we use dual decomposition to develop multi-
step consensus iterations with accelerated convergence.
1) Consensus Algorithm Using Dual Decomposition: It
is possible to achieve similar iterations as primal consensus
using networked minimization of quadratic functions
min ∑
n
i=1
1
2
(xi− ci)2
subject to xi = x j,∀(xi,x j) ∈ E (19)
Any distributed method for solving this problem is also a
distributed averaging (or average consensus) algorithm. A
primal-dual based algorithm combined with a subgradient
method to solve (19) is presented in [17], and an alternating
direction multiplier to cast the optimization problem in
distributed fashion is discussed in [18]. The iterations in
the latter are shown to be resilient to communication noises.
One can re-write (19) in vector notation and apply Lagrange
duality to the coupling constraint to find the Lagrangian
L(x,µ) =
1
2
(x− c)T (x− c)+µT Ax (20)
By the first-order optimality conditions for (20) we can define
the dual problem as following unconstrained minimization
minimize −g(µ) =− 1
2
µT AAT µ −µT Ac (21)
For given Lagrange multiplier µ , the primal variable x will be
updated by minimizing the Lagrangian (20). The accelerated
gradient method hence suggests the multi-step iteration
µk+1 = µk −α(AAT µk −Ac)+β (µk −µk−1)
xk+1 = c−AT µk+1 (22)
Note that the µ-iterations in the dual scheme has the same
form as the distributed resource allocation iterations under
Laplacian weight selection. Hence, our analysis and design
rules for selecting optimal algorithm parameters apply imme-
diately. To understand the relationship between this method
and alternative schemes in the literature, it is useful to try
to eliminate the µ-update and only consider the dynamics
of the primal variables. To this end, multiplying AT on both
sides of (22) yields
AT µk+1 = A
T µk −αAT (AAT µk −Ac)+βAT (µk −µk−1)
(23)
Using AT µk = c− xk and letting W = AT A gives
xk+1 = ((1+β )I−αW )xk −βxk−1 (24)
As argued previously, W is positive semidefinite with a
simple eigenvalue at 0 and fulfills W1 = 0, 1TW = 0. With
this definition, the consensus iterations coincide with the
general results of Theorem 2 (the Hessian is identity in (10)).
We compare this simple technique with two alternative
acceleration methods; one from the literature on accelerated
consensus, and the other one from the literature on (central-
ized) first-order techniques for convex optimization.
2) Accelerated Consensus via Shift Registers: Shift reg-
isters can be used to speed up convergence in stochastic
form of (18). In [6] it is demonstrated by simulations that
the consen
本文档为【Accelerated gradient methods for networked optimization】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。