首页 Accelerated gradient methods for networked optimization

Accelerated gradient methods for networked optimization

举报
开通vip

Accelerated gradient methods for networked optimization Accelerated Gradient Methods for Networked Optimization Euhanna Ghadimi, Mikael Johansson and Iman Shames Abstract—This paper explores the use of accelerated gra- dient methods in networked optimization. Optimal algorithm parameters and associated convergen...

Accelerated gradient methods for networked optimization
Accelerated Gradient Methods for Networked Optimization Euhanna Ghadimi, Mikael Johansson and Iman Shames Abstract—This paper explores the use of accelerated gra- dient methods in networked optimization. Optimal algorithm parameters and associated convergence rates are derived for distributed resource allocation and consensus problems, and the practical performance of the accelerated gradient algorithms are shown to outperform alternatives in the literature. Since the optimal parameters for the accelerated gradient method depends on upper and lower bounds of the Hessian, we study how errors in these estimates influence the convergence rate of the algorithm. This analysis identifies, among other things, cases where erroneous estimates of the Hessian bounds cause the accelerated method to have slower convergence than the corresponding (non-accelerated) gradient method. An applica- tion to Internet congestion control illustrates these issues. I. INTRODUCTION Distributed optimization has recently attracted a significant attention from several different research communities. Exam- ples include the work on network utility maximization for re- source allocation in communication networks [1], distributed coordination of multi-agent systems [2], and collaborative estimation and event detection in wireless sensor networks (WSNs) [3] and many others. The majority of these praxes rely on the application of gradient or subgradient methods to the dual formulation of the decision problem at hand (cf. [1]). Although gradient methods are easy to implement and require modest computations, they suffer from slow convergence rates. In certain special cases, such as the cele- brated development of distributed power control algorithms for cellular phones [4], one can replace gradient methods by fixed-point iterations and achieve improved convergence rates. For other problems, such as average consensus [5], a number of heuristic methods have been proposed that improve the convergence rate of the standard method [6], [7]. However, we are not interested in techniques that apply only in special cases. We would like to develop general-purpose schemes that retain the simplicity and the applicability of the gradient method while improving the convergence rates. In the optimization literature, there are essentially two ways of accelerating the convergence rate of the gradient methods. One is to use higher-order methods, such as Newton’s method [8]. Although distributed Newton methods have recently been developed for special problem classes, they impose large communication overhead. Another way to improve convergence is to use multi-step methods [9], [8]. These methods only rely on gradient information (and can hence often be implemented based on local information) but use a history of past iterates to compute the next. The authors are with the ACCESS Linnaeus Center, Electrical En- gineering, Royal Institute of Technology, Stockholm, Sweden. E-mails: {euhanna, mikaelj, imansh}@ee.kth.se This paper explores how multi-step methods can be used in networked optimization. Our initial focus is to attempt to accelerate the projected gradient method of [10]. We derive optimal parameters for the algorithm and show how these are directly related to the topology of the underlying com- munication graph. Contrary to the gradient descent method, which only needs an upper bound on the Hessian for finding a step size that ensures convergence, the multi-step method need both the upper and the lower bounds on the Hessian. Due to this fact, we formally analyze the convergence and compute the convergence rate in the presence of estimation errors in the Hessian bounds. Later, we illustrate how the technique allows us to derive accelerated consensus iterations and demonstrate improved convergence rates relative to other acceleration schemes proposed in the literature [11], [6]. Finally, we implement the techniques that we have developed to accelerate the network flow control algorithm described in [12]. We discuss how uncertainties in finding the bounds on Hessian can affect the convergence rate of the algorithm. The rest of the paper is organized as follows. In Section II, we review distributed resource allocation and multi-step gradient techniques. In Section III we present our accelerated resource allocation algorithm with proofs of convergence and performance comparison with the basic scaled gradient method. In Section IV, robustness analysis of multi-step algorithm in the presence of uncertainty is presented. Section V is devoted to other applications and considers accelerated consensus and network flow control. Conclusion remarks and future work outlook are presented in Section VI. II. PRELIMINARIES We consider constrained optimization problems on a net- work of nodes. The network is modeled as a graph G (V ,E ) with vertices (nodes) in the set V = {1,2, ..,n} and pairs of nodes as edges in the set E ⊆ V × V . We use Ni = { j|(i, j) ∈ E } to denote the set of neighbors of node i. A. Resource Allocation Under Total Budget Constraint Consider the following resource allocation problem minx ∑ n i=1 fi(xi) subject to ∑ n i=1 xi = xtot (1) Where ∑ n i=1 xi = xtot is called the budget constraint, and fi are real-valued strictly convex and twice differentiable functions whose second derivatives satisfy Li ≤ f ′′i (xi)≤Ui, i = 1, ..,n, (2) for known constants Ui ≥ Li > 0. A distributed resource allocation mechanism that maintains the budget constraint 2011 American Control Conference on O'Farrell Street, San Francisco, CA, USA June 29 - July 01, 2011 978-1-4577-0081-1/11/$26.00 ©2011 AACC 1668 at all times was developed in [10], [13]. nodes iteratively exchange resources via the scaled gradient method xk+1 = xk −W∇ f (xk) (3) The weight matrix W needs to satisfy 1TW = 0, W1 = 0 for the total resource constraint to be always satisfied, and needs to have the same sparsity pattern as the underlying graph to ensure that nodes only exchange resources with neighbors. One matrix that satisfies these constraints on W is the Laplacian of the underlying graph L =A(G )A(G )T [14]. Here, A(G ) is the adjacency matrix of G with entries 1 when (i, j) ∈ E , −1 when ( j, i) ∈ E and 0 otherwise. To ensure convergence of the iteration (3), the Laplacian has to be appropriately scaled W =−δL for some δ > 0. The Laplacian weights relevant to a specific node can be determined using local topology information. Specifically, [W ]i j =   δ , (i, j) ∈ E , −δdi i = j, 0 otherwise. (4) Where di is the degree of node i. Following the notation in [13], we call these constant weights, since all edges are assigned a constant weight and then Wii is adjusted to make sure that W1= 0. Several heuristic weight choices are introduced in [13] . For the special case when Ui = 1 for all i∈N , this includes: the maximum degree weights where δ = 1/maxi di; the best constant weights, for which δ = 2/(λ2(L )+λn(L )); and the Metropolis-Hasting weights [W ]i j =   −min{1/di,1/d j} (i, j) ∈ ε, −∑(i,k)∈ε Wik i = j, 0 otherwise. (5) We will return to this method in Section III and see how accelerated gradient techniques, described next, allow to achieve improved convergence rates compared to (3). B. Multi-step Gradient Methods The classical gradient method for minimization of a con- vex function f takes the form xk+1 = xk −α∇ f (xk) (6) for some step size parameter α > 0. This method is a suitable choice for distributed optimization due to its simplicity and ease of implementation. However, gradient based algorithms often exhibit slow convergence rate [9]. The convergence rate can be improved by accounting for the history of the process which is already obtained in the preceding iterations. Methods in which the new approximation depends on the preceding ones are called multi-step methods. In this paper, we use two-step iterations of the form xk+1 = xk −α∇ f (xk)+β (xk − xk−1) (7) where α > 0, β ≥ 0 are fixed step sizes. This method, pro- posed for centralized applications by Polyak [9] is called the Heavy Ball (HB) method and can be tuned to have smoother trajectory toward the local minimum point compared with traditional the gradient iterations [9]. The smoother trajectory translates to faster convergence rates. The reason why we focus on this method is that it is computationally simple and does not need higher order information which might not be locally available in distributed applications. III. ACCELERATED RESOURCE ALLOCATION Our first contribution in this paper is to consider the application of multi-step methods to the distributed resource allocation problem (1). To this end, consider the iteration xk+1 = xk −αW∇ f (xk)+β (xk − xk−1) (8) Let x⋆ be an optimal point of f (x). Using Taylor expansion ∇ f (xk) = ∇ 2 f (x⋆)(xk − x⋆)+o(xk − x⋆)2 Letting zk+1 = (xk+1− x⋆, xk − x⋆), we can rewrite (8) as zk+1 = A zk +o(zk) (9) where the 2n×2n-square matrix A is given by A = [ (1+β )I−αWH −β I I 0 ] , H = ∇2 f (x⋆) (10) Let L= λ1(H)≤ λ2(H)≤ ...≤ λn(H) =U be the eigenvalues of H. In what follows we study the local convergence of the iterative process described by (8). Theorem 1: Let ω =WH and λn(ω) be the largest eigen- value of ω and x⋆ be a nonsingular minimum point of f (x),x ∈ Rn. Then for 0≤ β < 1, 0< α < 2(1+β ) λn(ω) , LI ≤∇2 f (x⋆)≤UI (11) the method (8) converges to x⋆ with the rate of geometric progression: ||xk+1− x⋆|| ||xk − x⋆|| ≤ q1 0≤ q1 < 1. Proof: For brevity we omit the proofs. The interested reader may refer to [15]. The next theorem gives the optimal step sizes as well as the optimum convergence rate of (8). Theorem 2: The convergence rate of (8) is given by q1=max { 2 √ β , |1+β −αλ2(ω)| , |1+β −αλn(ω)| } − √ β (12) The minimal value of q1, q⋆1 = √ λn(ω)− √ λ2(ω)√ λn(ω)+ √ λ2(ω) (13) is obtained for the step sizes α = α⋆,β = β ⋆ with α⋆ = 4(√ λn(ω)+ √ λ2(ω) )2 β ⋆ = (√ λn(ω)− √ λ2(ω)√ λn(ω)+ √ λ2(ω) )2 (14) It is known that the convergence rate of the non- accelerated method is (cf. [9], [8], [13]) q2 = λn(ω)−λ2(ω) λn(ω)+λ2(ω) (15) 1669 0 10 20 30 40 50 60 70 80 90 100 10 í4 10 í2 10 0 10 2 10 3 10 4 t f( x (t )) í f* XiaoíBoydímetro XiaoíBoydímax degree XiaoíBoydíbest const HBímetro HBímax degree HBíbest const Fig. 1. Convergence behavior of HB and Xiao and Boyd method using randomly generated network and all the heuristic weights. plot shows the objective function values f (x(t))− f ⋆ versus iteration number t. One can verify that q1≤ q2, and that the improvement in con- vergence rate is depends on the quantity κ = λn(ω)/λ2(ω). In particular, when κ is large, then the speed-up is roughly proportional to √ κ (cf. the analysis for the centralized heavy- ball method in [9]). Large values of κ essentially appear for two reasons: one is a large spread in the values Ui between nodes, and the other is the topology of the underlying graph. Assume for simplicity that Ui = 1 for all i, so that ω =W and consider Laplacian weights. It is well-known from spectral graph theory [14] that the complete graph with n vertices has λ2(L ) = λn(L ) = n/(n− 1), so κ = 1 and there is no real advantage of using the accelerated scheme. On the other hand, for a ring network of n nodes, the eigenvalues of the Laplacian are 1− cos 2pik n for k = 0, ....,n− 1, which means that κ grows quickly with n, and the performance improvements of the accelerated methods can be substantial. 1) Numerical Examples: In this section we numeri- cally compare the performance of the accelerated and non- accelerated resource allocation method under various weight matrices described in the previous section. To make a fair comparison, we consider a random graph generated in a similar way as in [13]. The network shown in the Fig. 1 consists of 20 nodes, each of degree three. Edges are bidirectional and the objective function at each node has the form fi(xi) = 1 2 ai(xi − ci)2 + log[1+ exp(bi(xi − di))], i = 1, ..,n. The coefficients ai,bi,ci, anddi are drawn uniformly from the intervals [0,2], [−2,2], [−10,10]and [−10,10], re- spectively, and kept the same for all simulations. As initial values, we fix the sum of variables to zero (i.e., ∑ n i=1 xi = 0). The second derivative of the functions fi are positive and lower and upper bounded are given by li = ai, ui = ai + 1 4 b2i , i = 1, ...,n. Using the techniques in [13], we set the parameters δ = −0.1251 for maximum degree weight and δ ⋆ =−0.2030 for best constant weight scheme. Fig. 1 shows the objective value minus the optimal value as function of iteration count. The plot shows that the accelerated gradient method yields a significantly increased convergence rate. IV. ROBUSTNESS ANALYSIS When the upper and lower bounds on the Hessian are known and their ratio is significant, multi-step methods give a considerable increase in convergence rate over the standard gradient iteration. However such upper and lower bounds are sometimes hard to estimate accurately in practice. It is therefore important to analyze the sensitivity of multi-step methods to errors in the Hessian bounds to assess if the performance benefits prevail if the bounds are inaccurate. Such a robustness analysis will be performed next. Let L and U be the true upper and lower bounds of the Hessian of the objective function, and let Lˆ and Uˆ be the esti- mated bounds used when tuning the gradient and accelerated gradient methods. We would like to observe for which values of Lˆ and Uˆ the two methods converge under their “optimal” step-size rules and compare the associated convergence rates. In [9] sufficient conditions for the convergence of gradient iterates of smooth functions are given. According to [9, Theorem 3], for fixed step size 0 < α < 2/U , the gradient algorithm converges with rate q2 =max{|1−αL|, |1−αU |< 1} . The minimum value q⋆2 = (U−L)/(U +L) is attained by the optimal step size α⋆ = 2/(L+U). Together with the analysis in Theorem 1 this yields the following observation: Proposition 1: Consider 0 < Lˆ < Uˆ to be the erroneous estimates of the Hessian bounds L,U respectively. For all values of Lˆ,Uˆ fulfilling the condition U < Uˆ + Lˆ both the gradient iteration (6) with step size selection αˆ = 2/(Lˆ+Uˆ) and the HB algorithm (7) with parameters αˆ = 4/( √ Lˆ+ √ Uˆ)2, βˆ = (( √ Uˆ − √ Lˆ)/( √ Uˆ + √ Lˆ))2 converge to the optimum of f (x). According to this proposition, the convergence of both meth- ods is rather similar. To compare convergence rates, we start by presenting the following lemma. Lemma 1: For parameters Lˆ,Uˆ satisfying 0 q ⋆ 2 for either of two cases, i.e. the gradient method with misestimated Hessian bounds has a slower convergence rate than the optimally tuned one. The best step size choice unexpectedly happens when Lˆ+ Uˆ = L+U , for which qˆ2 = q ⋆ 2. On the other hand, Theorem 2 establishes the convergence rate of HB with arbitrary (uncertain) step sizes to be qˆ1 =max {√ βˆ , |1+ βˆ − αˆL|− √ βˆ , |1+ βˆ − αˆU |− √ βˆ } . (17) where αˆ and βˆ are the values of the optimal step size rule when the erroneous Hessian bounds are used. It is interesting to note that for Lˆ = Uˆ and the convergence constraint, Lˆ+ 1670 Uˆ >U , both methods converges with rate 1−L/Lˆ. However, when the Hessian bounds are not known, the Heavy Ball method can either perform better or worse than the gradient method. We have the following results Proposition 2: Assuming parameters Lˆ > L,Uˆ > U . then convergence rate of the Heavy Ball method is qˆ1 = 1+ βˆ − αˆl− βˆ 1/2 which is faster than the gradient alternative. Proposition 3: Assuming parameters Lˆ < L,Uˆ > U and Lˆ+Uˆ = L+U . Then the convergence rate of the Heavy Ball method is given by qˆ1 = βˆ 1/2. Moreover, if (Uˆ/Lˆ)1/2 >U/L, then this rate is slower than the gradient alternative. From our simulation experience, the situation described in Proposition 3 is rather singular. In fact, we have not yet experienced this situation in non-contrived scenarios. V. FURTHER APPLICATIONS A. Accelerated Consensus Distributed algorithms for consensus seeking, first pro- posed in Tsitsiklis et al. [5], have been heavily researched during the last few of years. We consider consensus via linear iterations on the form xi(k+1) =Wiixi(k)+ ∑ j∈Ni Wi jx j(k), i = 1, ...,n, (18) Here, Wi j is the weight on x j assigned in node i. In [16] necessary and sufficient conditions on W for (18) to converge to the average of initial values are given. More specifically, it is shown that such matrices W have their largest eigenvalue equal to 1 while their second largest eigenvalue is strictly less than 1 and determines the asymptotic convergence factor towards consensus [16]. For symmetric weights, [16] derived a semi-definite program (SDP) for finding the weight matrix W with maximized convergence rate and proposed several simple heuristics for finding suboptimal weights, including constant and Metropolis-Hastings weights (cf. [16], [2]). In what follows we use dual decomposition to develop multi- step consensus iterations with accelerated convergence. 1) Consensus Algorithm Using Dual Decomposition: It is possible to achieve similar iterations as primal consensus using networked minimization of quadratic functions min ∑ n i=1 1 2 (xi− ci)2 subject to xi = x j,∀(xi,x j) ∈ E (19) Any distributed method for solving this problem is also a distributed averaging (or average consensus) algorithm. A primal-dual based algorithm combined with a subgradient method to solve (19) is presented in [17], and an alternating direction multiplier to cast the optimization problem in distributed fashion is discussed in [18]. The iterations in the latter are shown to be resilient to communication noises. One can re-write (19) in vector notation and apply Lagrange duality to the coupling constraint to find the Lagrangian L(x,µ) = 1 2 (x− c)T (x− c)+µT Ax (20) By the first-order optimality conditions for (20) we can define the dual problem as following unconstrained minimization minimize −g(µ) =− 1 2 µT AAT µ −µT Ac (21) For given Lagrange multiplier µ , the primal variable x will be updated by minimizing the Lagrangian (20). The accelerated gradient method hence suggests the multi-step iteration µk+1 = µk −α(AAT µk −Ac)+β (µk −µk−1) xk+1 = c−AT µk+1 (22) Note that the µ-iterations in the dual scheme has the same form as the distributed resource allocation iterations under Laplacian weight selection. Hence, our analysis and design rules for selecting optimal algorithm parameters apply imme- diately. To understand the relationship between this method and alternative schemes in the literature, it is useful to try to eliminate the µ-update and only consider the dynamics of the primal variables. To this end, multiplying AT on both sides of (22) yields AT µk+1 = A T µk −αAT (AAT µk −Ac)+βAT (µk −µk−1) (23) Using AT µk = c− xk and letting W = AT A gives xk+1 = ((1+β )I−αW )xk −βxk−1 (24) As argued previously, W is positive semidefinite with a simple eigenvalue at 0 and fulfills W1 = 0, 1TW = 0. With this definition, the consensus iterations coincide with the general results of Theorem 2 (the Hessian is identity in (10)). We compare this simple technique with two alternative acceleration methods; one from the literature on accelerated consensus, and the other one from the literature on (central- ized) first-order techniques for convex optimization. 2) Accelerated Consensus via Shift Registers: Shift reg- isters can be used to speed up convergence in stochastic form of (18). In [6] it is demonstrated by simulations that the consen
本文档为【Accelerated gradient methods for networked optimization】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_079104
暂无简介~
格式:pdf
大小:258KB
软件:PDF阅读器
页数:6
分类:
上传时间:2012-01-19
浏览量:22