首页 阅读材料1

阅读材料1

举报
开通vip

阅读材料1 Computer Science and Artificial Intelligence Laboratory Technical Report m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u MIT-CSAIL-TR-2007-025 CBCL-268 ...

阅读材料1
Computer Science and Artificial Intelligence Laboratory Technical Report m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Notes on Regularized Least-Squares Ryan M. Rifkin MIT Center for Biological and Computational Learning rif@mit.edu Ross A. Lippert D. E. Shaw Research ross.lippert@deshaw.com Abstract This is a collection of information about regularized least squares (RLS). The facts here are not “new results”, but we have not seen them usefully collected together before. A key goal of this work is to demonstrate that with RLS, we get certain things “for free”: if we can solve a single supervised RLS problem, we can search for a good regularization parameter λ at essentially no additional cost. The discussion in this paper applies to “dense” regularized least squares, where we work with matrix factorizations of the data or kernel matrix. It is also possible to work with iterative methods such as conjugate gradient, and this is frequently the method of choice for large data sets in high dimensions with very few nonzero dimensions per point, such as text classifciation tasks. The results discussed here do not apply to iterative methods, which have different design tradeoffs. We present the results in greater detail than strictly necessary, erring on the side of showing our work. We hope that this will be useful to people trying to learn more about linear algebra manipulations in the machine learning context. 1 Notation and Basics We are given data points S = {(X1, Y1), . . . , (Xn, Yn)}. We define Si to be the data set with the ith point removed: Si = {(X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)}. We let X simultaneously refer to the set {X1, . . . , Xn} and to the n by d matrix whose ith row is Xti . We assume a positive semidefinite kernel function k, which generalizes the notion of dot product in a Reproducing Kernel Hilbert Space (RKHS) [1, 6]. Commonly used kernels include: linear: k(Xi, Xj) = XtiXj polynomial: k(Xi, Xj) = (XtiXj + 1) d gaussian: k(Xi, Xj) = exp ( ||Xi −Xj ||2 σ2 ) The polynomial order d or gaussian bandwidth σ must be specified by the user. We define the kernel matrix K to satisfy Kij = k(Xi, Xj). Abusing notation slightly, we allow the kernel function k to take multiple data points and produce a matrix of results: k(X,X) = K, and, given an arbitrary point X∗, k(X,X∗) is a column vector whose ith entry is k(Xi, X∗). We note that the linear kernel (also called the dot-product kernel) has special computational properties, which we discuss in Section 5. Given a square matrixM , diagm(M) denotes the diagonal matrix satisfying diagm(M)ii =Mii, and diagv(M) the column vector satisfying diagv(M)i = Mii. We abuse notation somewhat by assuming that fractional division is done elementwise, so a vector divided by a vector yields another vector. We use the standard convention that I represents an identity matrix of the appropriate size (we occasionally index the size of the matrix for clarity, e.g. In), and ei represents a column vector with a 1 in the ith position and zeros elsewhere. 2 RLS Basics RLS arises as a Tikhonov minimization problem [2] with a square loss: min f∈H 1 2 n∑ i=1 (f(Xi)− Yi)2 + λ2 ||f || 2 K . (1) The representer theorem [7, 5] guarantees that the solution to (1) can be written as f = n∑ i=1 cik(Xi, ·), for some c ∈ Rn. Using basic properties of RKHS [1, 6], we can therefore rewrite (1) as min c∈Rn 1 2 ||Y −Kc||22 + λ 2 ctKc. Setting the derivative with respect to c to zero, we see that c must satisfy (K + λI)c = Y. We note that c exists and is unique: K is positive semidefinite, so K + λI is positive definite (for λ > 0). We define G(λ) = K + λI. Frequently, λ will be clear from context, and we just write G. The predictions at the training points will be given by f(X) = Kc = K(K + λI)−1Y = KG−1Y, and the prediction at a new test point X∗ is: f(X∗) = ∑ cik(Xi, X∗) = k(X,X∗)tc = Y tG−1k(X,X∗). Note that we are not suggesting forming G−1 explicitly and manipulating it; we discuss computation below. 3 Leave-one-out computations In general, we need some mechanism for finding a “good” value of the regularization parameter λ. We are usually interested in finding a function that does well on future examples. Since the distribution over examples is unknown, this is not that easy. One method is to use a “holdout” set: we try a bunch of values of λ, and for each value, we compute the performance the holdout (also called the development) set, and choose the λ that does best on it. If we have enormous quantities of data, a holdout set is a good option. If we have relatively few data points, we might like to use leave-one-out values instead. The ith leave-one-out value is fSi(Xi), the value of the RLS function trained on Si applied at Xi. The ith leave-one-out error is Yi − fSi(Xi). We define LOOV and LOOE to be the vectors of leave-one-out values and errors over the training set. ||LOOE||22 is considered a good empirical proxy for the error on future points, and we often want to choose parameters by minimizing this quantity.1 In this section, we develop tools for working with LOOE and LOOV for RLS. We will see that for RLS, working with LOOE and 1It is possible to overfit by doing this, but in practice it is often quite effective. Note that combining minimizing ||LOOE|| with feature selection on the entire dataset can overfit disastrously — in general, if we are minimizing ||LOOE||, we must be able to “form” fSi without looking at Xi at all. 2 LOOV are computationally effective. Note that this material supersedes Section 3.3 of [4], which contains several typos. We consider two different approaches (really two different views of the same approach) for finding leave-one- out values. The first approach can be seen in many references, such as [4] or [3]. The second approach, which demonstrates that leave-one-out RLS computations can be treated as augmented linear systems, is (to our knowledge) an original development by the authors. 3.1 The Direct Approach Imagine we knew fSi . Define the vector Y i via Y ij = { Yj j 6= i fSi(Xi) j = i If we solve (1) but replace Y with Y i, the optimal solution will be fSi . This is easy to see — if we solve a smaller problem by discarding the ith example entirely, we will get fSi , and adding the example (Xi, fSi(Xi)) adds nothing to the cost. More formally, for any f ∈ H, 1 2 n∑ j=1 (Y ij − f(Xi))2 + λ 2 ||f ||2K ≥ 1 2 ∑ j 6=i (Y ij − f(Xi))2 + λ 2 ||f ||2K ≥ 1 2 ∑ j 6=i (Y ij − fSi(Xi))2 + λ 2 ||fSi ||2K = 1 2 n∑ j=1 (Y ij − fSi(Xi))2 + λ 2 ||fSi ||2K . We note in passing that this result is much more general than what we’ve just written: we did not use the fact that we are optimizing an RKHS norm, or the specific form of the square loss — we used only that the square loss allows us to “construct” a Y value with loss zero. The above derivation makes it clear that for RLS, we can define the expansion coefficients for fSi via ci = G−1Y i, and therefore fSi(Xi) = (KG−1Y i)i. Note that this is all theoretical so far, because in order to form Y i, we need to know fSi(Xi). However, assuming we have solved RLS, and therefore computed fS(X) = KG−1Y , we can derive a nice expression for fSi(Xi): fSi(Xi)− fS(Xi) = ∑ j (KG−1)ij(Y ij − Yj) = (KG−1)ii(fSi(Xi)− Yi), which leads immediately to fSi(Xi) = fS(Xi)− (KG−1)iiYi 1− (KG−1)ii = (KG−1Y )i − (KG−1)iiYi 1− (KG−1)ii . Therefore, LOOV = KG−1Y − diagm(KG−1)Y diagv(I −KG−1) , LOOE = Y − LOOV = Y + diagm(KG−1)Y −KG−1Y diagv(I −KG−1) 3 = diagm(I −KG−1)Y diagv(I −KG−1) + diagm(KG−1)Y −KG−1Y diagv(I −KG−1) = Y −KG−1Y diagv(I −KG−1) . (2) We can simplify our expressions in a way that leads to better computational and numerical properties by noting that KG−1 = QΛQtQ(Λ + λI)−1Qt = QΛ(Λ + λI)−1Qt = Q(Λ + λI − λI)(Λ + λI)−1Qt = I − λG−1. Substituting into our expression for LOOE yields LOOE = Y −KG−1Y diagv(I −KG−1) = Y − (I − λG−1)Y diagv(I − (I − λG−1)) = λG−1Y diagv(λG−1) = G−1Y diagv(G−1) = c diagv(G−1) . 3.2 The Augmented Linear System Approach Suppose we consider the augmented optimization problem min c∈Rn,β∈R 1 2 ||Y −Kc− βei||22 + λ 2 ctKc. While it is not immediately obvious, we will now proceed to show that solving this n+1 variable optimization problem yields the ith LOO error. Taking the derivative of the augmented problem with respect to c and β shows that the solution is given by the solution of the augmented linear system[ K + λI ei etiK 1 ] [ c β ] = [ Y etiY ] . We begin with an intuitive argument. From inspection of either the original augmented problem of the linear system, it is obvious that β will take the value Y − (Kc)i = Yi − etiKc. Using this fact, the top row of our augmented linear system can be rewritten as (K + λI)c+ ei(Yi − etiKc) = Y( (I − eieti)K + λI ) c = Y − eiYi. Consider the ith equation in the above system. The ith row of (I − eieti)K is zero, and the ith element on the right hand side is also zero. Therefore λci = 0, and we conclude that ci = 0. It is now clear that the remaining c’s must be the RLS solution over the other n − 1 points, and that β is a “readout” of the ith RLS error. We can also easily derive the actual value of the LOO error. We take the top equation in the linear system and multiply it (on the left) by −etiK(K + λI)−1, yieding −etiKc− etiK(K + λI)−1eiβ = etiK(K + λI)−1Y. 4 We add this to the bottom equation, obtaining( 1− etiK(K + λI)−1ei ) β = Yi − etiK(K + λI)−1Y β = Yi − etiK(K + λI)−1Y 1− etiK(K + λI)−1 , which is the result we derived in Equation 2. 4 Searching for λ is free For a single value of λ, we can solve RLS by solving the linear system (2). In Matlab, we would do this with the ’slash’ operator, rather than explicitly forming the inverse. However, usually we do not know a good value of λ in advance. We will therefore make use of the eigendecomposition K = QΛQt, where Λ is diagonal with Λii ≥ 0 and QQt = I. We now see that G = K + λI = QΛQt + λI = Q(Λ + λI)Qt, which of course implies G−1 = Q(Λ + λI)−1Qt. It takes O(n3) time to compute the eigendecomposition.2 Asymptotically, this is no slower than solving a single linear system, although the constant is worse by maybe a factor of 4. Once we have this eigendecom- position, we can find c for a given λ in O(n2) time by solving c(λ) = Q(Λ + λI)−1QtY, noting that (Λ + λI) is diagonal, so multiplying the vector QtY by its inverse is a linear time operation. Once we have c, we can also get the predictions Kc in O(n2) time. Furthermore, we can compute a single entry of G(λ)−1 in O(n) time: G−1ij = (Q(Λ + λI) −1Qt)ij = n∑ k=1 QikQjk Λkk + λ , and therefore we can compute diag(G−1), and compute LOOE, in O(n2) time. In conclusion, it takes O(n3) to solve a single RLS problem for c. If we compute an eigendecomposition of K, we can compute c(λ) and ||LOOE(λ)|| over a quite fine grid of λ, at basically no additional cost. 5 The Linear Case We now discuss the special case of a linear kernel k(Xi, Xj) = Xi ·Xtj . We assume throughout this section that we are in Rd, with d < n. (If d > n, we are better off ignoring the linearity, and working with the n by n kernel matrix). It is easy to see that with a linear kernel, the function we are learning is linear as well: f(x) = ctk(X,x) = ctXx = wtx, 2In practice, to machine precision. In theory, finding an exact eigendecomposition is as hard as exactly finding the roots of an arbitrary polynomial. 5 where we define the hyperplane w to be Xtc. We can classify new points in O(d) time, using w, rather than having to compute a weighted sum of n kernel products (which will usually cost O(nd) time). In the nonlinear case, there is basically only one technique available: to work with the eigendecomposition of K. In the linear case, we have two options. The first is to work with the singular value decomposition of X and to compute leave-one-out values. The second is to work with the eigendecomposition of XtX and to use a hold-out set to validate λ. 5.1 The SVD Approach Suppose that we only have a moderate amount of data (n is not too large), and we want to compute leave-one-out values to validate λ. Then we can work with the SVD of X. The economy-size SVD of X can be written as X = USV t, with U ∈ Rn×d, S ∈ Rd×d, V ∈ Rd×d, U tU = V tV = V V t = Id, and S diagonal and positive definite. (Note that UU t 6= In). In Section 3 we derived an expression for the leave-one-out error in terms of c and G−1. In the linear case, we can use the SVD to express these quantities directly in terms of X, rather than K: K = XXt = (USV t)(V SU t) = US2U t K + λI = US2U t + λIn = [ U U⊥ ] [ S2 + λId λIn−d ] [ U t U t⊥ ] = U(S2 + λId)U t + λU⊥U t⊥ = U(S2 + λId)U t + λ(In − UU t) (K + λI)−1 = (US2U t + λIn)−1 = ( [ U U⊥ ] [ S2 + λId λIn−d ] [ U t U t⊥ ] )−1 = [ U U⊥ ] [ S2 + λId λIn−d ]−1 [ U t U t⊥ ] = U(S2 + λI)−1U t + λ−1U⊥U t⊥ = U(S2 + λI)−1U t + λ−1(I − UU t) = U [ (S2 + λI)−1 − λ−1I]U t + λ−1I c = (K + λI)−1Y = U [ (S2 + λI)−1 − λ−1I]U tY + λ−1Y G−1ij = d∑ k=1 UikUjk[(Skk + λ)−1 − λ−1] + [i = j]λ−1 G−1ii = d∑ k=1 U2ik[(Skk + λ) −1 − λ−1] + λ−1 LOOE = c diagv(G−1) = U [ (S2 + λI)−1 − λ−1I]U tY + λ−1Y diagv(U [(S2 + λI)−1 − λ−1I]U t + λ−1I) 6 5.2 The SVD Approach, Direct Derivation We can obtain the same results via an alternate derivation, which we include for pedagogical purposes. Suppose we phrase the RLS problem directly in terms of w, rather than c: min w∈Rd L(w) = 1 2 ||Y −Xw||22 + λ 2 ||w||22. Taking the derivative with respect to w, ∂L ∂w = XtXw −XtY + λw, and setting to zero implies w = (XtX + λI)−1XtY = V (S2 + λI)−1V tXtY (= V (S2 + λI)−1V t(V SU t)Y ) (= V S(S2 + λI)−1U tY ) Define w\i to be the hyperplane obtained when we remove the ith training point and train on the remaining n− 1 points (the linear analog of fSi). Defining Y i as before, min w∈Rn 1 2 ∑ j (Y ij − wtxj)2 + λ||w||22 ≥ min w∈Rn 1 2 ∑ j 6=i (Y ij − wtxj)2 + λ||w||22 = 1 2 ∑ j 6=i (Y ij − wt\ixj)2 + λ||w\i||22 = 1 2 ∑ j (Y ij − wt\ixj)2 + λ||w\i||22, so w\i “solves” RLS with Y replaced with Y i, and w\i = (XtX + λI)−1XtY i. Precisely analogously to the nonlinear case, (w\i − w)txi = ( (XtX + λI)−1Xt(Y i − Y ))t xi = (Y i − Y )tX(XtX + λI)−1xi = (wt\ixi − yi)xti(XtX + λI)−1xi, which implies wt\ixi = wtxi − yixti(XtX + λI)−1xi 1− xti(XtX + λI)−1xi . Therefore, LOOE = Y − Xw − diagm(X(X tX + λI)−1Xt)Y diagv(I −X(XtX + λI)−1Xt) = Y − X(X tX + λI)−1XtY − diagm(X(XtX + λI)−1Xt)Y diagv(I −X(XtX + λI)−1Xt) = Y −X(XtX + λI)−1XtY diagv(I −X(XtX + λI)−1Xt) 7 Noting that X(XtX + λI)−1Xt = USV t(V S2V t + λI)−1V SU t = US2(S2 + λI)−1U t = U [S2 + λI − λI](S2 + λI)−1U t = UU t − λU(S2 + λI)−1U t I −X(XtX + λI)−1Xt = I − λU(λ−1I)U t + λU(S2 + λI)−1U t = λ ( λ−1I + U [ (S2 + λI)−1 − λ−1I]U t) , we can derive the same expression for LOOE as we did in the previous section. 5.3 The covariance matrix approach If we require only w and not the LOO values, we need only the V and S pieces of the SVD. These can be computed from the eigendecomposition of the (nonstandarized) covariance matrix XtX, which is only d by d. Furthermore, computation of this matrix is fairly easily parallelizable. For very large datasets in moderate numbers of dimensions, this is probably the method of choice. 6 Conclusions The interplay of the RLS algorithm with simple matrix factorization techniques yields extremely efficient algorithms. For both linear and nonlinear RLS, in the same (asymptotic) time it takes to solve a single problem, we can find a good λ using LOO cross-validation. For linear RLS, we have an additional option of using a validation set, which can be even faster in practice. The primary disadvantage of nonlinear RLS is the need to work with the entire kernel matrix, which is n by n. There have been many attempts to approximate the entire function using an expansion over a subset of the data points. We leave the extension of the methods contained here to this case for future work. References [1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. [2] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advanced In Computational Mathematics, 13(1):1–50, 2000. [3] P. J. Green and B. W. Silverman. Nonparametric Regression and Generalized Linear Models. Number 58 in Monographs on Statistics and Applied Probability. Chapman & Hall, 1994. [4] Ryan M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches to Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002. [5] Bernhard Scho¨lkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In 14th Annual Conference on Computational Learning Theory, pages 416–426, 2001. [6] Bernhard Scho¨lkopf and Alex Smola. Learning with Kernels. MIT Press, 2002. [7] Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial & Applied Mathematics, 1990. 8
本文档为【阅读材料1】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_786373
暂无简介~
格式:pdf
大小:208KB
软件:PDF阅读器
页数:10
分类:其他高等教育
上传时间:2011-07-08
浏览量:23