Least Squares

Least Squares Chapter 5 Least Squares Abstract In this chapter we will cover the basics of the celebrated method of Least Squares (LS). The approach to this method is different from the stochastic gradient approach from the previous chapter. As always, the idea will be to...

Chapter 5 Least Squares Abstract In this chapter we will cover the basics of the celebrated method of Least Squares (LS). The approach to this method is different from the stochastic gradient approach from the previous chapter. As always, the idea will be to obtain an estimation of a given system using input-output measured pairs (and no statistical information), and assuming a model in which the input and output pairs are linearly related. We will also present the Recursive Least Squares (RLS) algorithm, which will be a recursive and a more computational efficient implementation of the LS method. One of its advantage is that it can be used in real time as the input-output pairs are received. In this sense, it will be very similar to the adaptive filters obtained in the previous chapter. Several important properties of LS and RLS will be discussed. 5.1 The Least Squares (LS) Problem Assume that we have the following quantities: {d(0), d(1), . . . , d(n − 1)} , {x(0), x(1), . . . , x(n − 1)} (5.1) where x(i) ∈ RL ,∀i = 0, 1, . . . , n −1. The values d(i) and the vectors x(i)1 can be assumed as the output and input from a given physical system. As discussed in Sect. 2.3, even though the actual relationship between the input and output pairs might not be linear, we can always write: d = CwT + v, (5.2) 1 Notice that at this moment of the presentation we are not assuming that the vectors x(i) are generated from a scalar signal x(i) through a tapped-delay line structure. L. Rey Vega and H. Rey, A Rapid Introduction to Adaptive Filtering, 89 SpringerBriefs in Electrical and Computer Engineering, DOI: 10.1007/978-3-642-30299-2_5, © The Author(s) 2013 90 5 Least Squares where: d = [d(0) d(1) . . . d(n − 1)]T and C = ⎡ ⎢⎢⎢⎣ xT (0) xT (1) ... xT (n − 1) ⎤ ⎥⎥⎥⎦ . (5.3) Notice that d ∈ Rn and C ∈ Rn×L . As in the linear regression model (2.11), the vector wT represents the true system when it exists and is linear, or a first linear approximation to the general input-output relation of the given data pairs. The vector v could represent modeling errors, additive noise perturbations, etc. The question is how to obtain wT based on the information given by d and C. As, we do not have any statistical information about the perturbation v it seems natural to search for wˆ such that: wˆ = arg min w∈RL ‖d − Cw‖2, (5.4) where ‖ · ‖ denotes the usual canonical Euclidean norm [1]. The criterion in (5.4) is known as the Least Squares problem and is satisfactory for two reasons. Firstly, as we are assuming that there is no statistical information in our problem, it makes sense to adopt a purely deterministic criterion of optimization, using only the input-output pairs given. The idea is to find a vector wˆ that explains the data given in the best manner possible using familiar Euclidean distances. Secondly, observe that in the special case that the input-output pairs come from jointly ergodic processes [2], we can write: 1 n ‖d − Cw‖2 = 1 n n−1∑ i=0 |d(i) − xT (i)w|2 −→ n→∞ E [ |d(i) − xT (i)w|2 ] . (5.5) So, if n is sufficiently large and the input-output pairs come from ergodic processes, the solution to the problem (5.4)2 coincides with the Wiener solution from Chap. 2. 5.1.1 Solution to the LS Problem In order to solve the problem in (5.4) we write ‖d − Cw‖2 = ‖d‖2 − 2wT CT d + wT CT Cw. (5.6) We can differentiate (5.6) with respect to w and equal to zero to obtain: CT Cw = CT d. (5.7) 2 The reader should notice that the factor 1 n does not modify the solution in (5.4). 5.1 The Least Squares (LS) Problem 91 Equation (5.7) is known as the normal equation. It can be shown that the normal equation always has a solution.3 Depending on the dimension of C, (5.7) can have one or infinite solutions: • n ≥ L: As C has more rows than columns, the numbers of equations in (5.2) is greater than the numbers of unknowns in w. It is said that the system in (5.2) is a overdetermined one. In this case there will be one or an infinite number of solutions to (5.7), depending on the column rank of C. If C has full column rank the solution is unique. If it is column rank deficient there exist an infinite number of solutions. • L < n: In this case C has more columns than rows. That is, in (5.2) there are more unknowns than equations. It is said that the system in (5.2) is an underdetermined one. In this case, the number of solutions to (5.7) is always infinite. This is because C would always be column rank deficient. The most common situation, at least in adaptive filtering,4 will be the first one, in which n ≥ L . Suppose then that C has full column rank. It can be easily shown that CT C ∈ RL×L has full rank and is invertible. It is obvious that the solution to (5.7) is unique and: wˆ = ( CT C )−1 CT d. (5.8) When C has not full column rank, the inverse of CT C does not exist. In this case, the number of solutions to (5.7) is infinite. It is clear that as CT C is singular, any solution to (5.7) can be put as: wˆ = wˆp + wˆh, (5.9) where wˆp is any solution to (5.7) and wˆh is any solution to: CT Cwˆh = 0, (5.10) that is, wˆh is any vector in N ( CT C ) , 5 the null space of CT C, which can be easily seen to be equal to N (C). It is also clear that the difference between two solutions wˆ1 and wˆ2 is a vector belonging to N (C). A natural question when (5.7) has multiple solutions is which one would be the best choice. There is no simple answer to that problem. A common choice would be the following: wˆ = arg min w∈RL ‖w‖2 subject to CT d = CT Cw. (5.11) 3 Notice that R ( CT C ) = R (CT ) [3], where R (A) denotes the range or column space of the matrix A. This implies that it should exist at least one w which satisfies Eq. (5.7). 4 In recent years there was a growing interest in underdetermined systems with sparseness conditions on the solution. This is commonly known as compressed sensing or compressed sampling. The interested reader can see [4] and the references therein. 5 As CT C is singular N ( CT C ) contains others vectors besides the null vector. 92 5 Least Squares This means that we would choose the solution of (5.7) with minimum canonical Euclidean norm. Then, given a particular solution wˆp, we need to search over the vectors wˆh ∈ N (C) that lead to wˆ = wˆp + wˆh having minimum norm. Although this seems to be difficult, the fact is that the solution to problem (5.11) is very simple and is given by wˆ = C†d, (5.12) where C† denotes the Moore-Penrose pseudoinverse of C. The pseudoinverse of a general rectangular matrix C is a generalization of the concept of inverse. Its calcula- tion in the most general case can be obtained from the singular value decomposition (SVD) of a matrix C [5]. The SVD is a very powerful technique to decompose a general rectangular matrix. 5.1.1.1 Singular Value Decomposition and Pseudoinverse Let C ∈ Rn×L be an arbitrary rectangular matrix. Matrix C can be written as: • If L ≤ n: C = U [ � 0 ] VT , � ∈ RL×L . (5.13) • If L ≥ n: C = U [� 0] VT , � ∈ Rn×n . (5.14) with U ∈ Rn×n and V ∈ RL×L being orthonormal matrices6 whose columns are the eigenvectors of CCT and CT C respectively. Square matrix � is diagonal with positive entries, that is: � = diag (σ1, σ2, . . . , σK , 0, . . . , 0) , (5.15) where σ 2i , i = 1, . . . , K , with K = rank(C) ≤ min (n, L), are the non-null eigenvalues of either CCT or CT C. The SVD is a very important tool in linear algebra, matrix analysis and signal processing. The reader interested in more details on SVD can see [5, 6] and the references therein. The pseudoinverse C† can be written is terms of its SVD as [7]: • If L ≤ n: C† = V [ �−1 0 ] UT . (5.16) • If L ≥ n: C† = V [ �−1 0 ] UT . (5.17) 6 UT U = UUT = In and VT V = VVT = IL . 5.1 The Least Squares (LS) Problem 93 Clearly, C†C = IL and CC† = In . When L ≤ n and C has full column rank, it is easy to show that: C† = ( CT C )−1 CT , (5.18) which when replaced in (5.12) leads to the unique solution (5.8). 5.2 Orthogonal Projection Interpretation The theory of LS is a very elegant one. Behind the LS problem is the concept of orthogonal projection as in the Wiener filtering case (see Sect. 2.3). In the following, we will show that the solution to the LS problem can be interpreted as a Euclidean orthogonal projection onto an appropriate subspace. Let us call S (C) the subspace generated by the columns of C. From (5.4) it is clear that the quantity dˆ � Cw, for any vector w ∈ RL (even the optimal one wˆ), can be written as: dˆ = L∑ i=1 wi u(i), (5.19) where u(i), i = 1, . . . , L are the columns of C. As the right hand side of (5.19) is a linear combination of the columns of C, we can conclude that dˆ ∈ S (C) . (5.20) This means that for every choice of w ∈ RL , the value of dˆ has to necessarily belong to S (C). In fact, problem (5.4) is equivalent to search for the vector dˆ ∈ S (C) which is closer to d in a Euclidean distance sense. The vector wˆ can be interpreted as the coordinates of the optimal dˆ in S (C). The problem of approximating a given vector with another vector in a given subspace, is a very well known problem in mathematics [8]. For an inner product space (that is, a Hilbert space), it is known that the solution to that problem is an orthogonal projection. In precise terms, and for our problem where the space is Rn with the usual canonical Euclidean inner product, this means: uT (i)e = 0, i = 1, . . . , L , (5.21) where e = d−Cwˆ is the error between the original d and the optimal approximation Cwˆ ∈ S (C). Equation (5.21) means that the approximation error for the optimal solution has to be orthogonal to the columns of matrix C, which is equivalent to being orthogonal to the entire subspace S (C). Let us define dˆopt � Cdˆ. In Fig. 5.1 we depict the geometrical properties of the orthogonal projection onto S (C). 94 5 Least Squares Fig. 5.1 Orthogonal projec- tion onto S (C) It is easy to check that the optimal dˆopt ∈ S (C) is unique. This means that it is the unique vector that satisfies (5.21), which can be put more compactly as: CT e = 0. (5.22) It is straightforward to see that the previous expression can be written as: CT Cwˆ = CT d, (5.23) which is the normal equation in (5.7). Assuming that C has full column rank, the solution for the optimal wˆ is given by (5.8) and the value of dopt is given by: dˆopt = C ( CT C )−1 CT d. (5.24) The n × n matrix: P [S (C)] = C ( CT C )−1 CT , (5.25) is called projection matrix, and gives a representation for the operator that computes the unique orthogonal projection of any vector in Rn onto S (C) . Projection matrices have important properties [3]: • P [S (C)] = PT [S (C)]. • P2 [S (C)] = P [S (C)]. • P [S⊥ (C)] = In − P [S (C)]. The second property says that a projection matrix is an idempotent operator. That is, applying the operator twice or more times is equivalent to applying it just once. This is equivalent to saying that if d ∈ S (C), then P [S (C)] d = d, which gives e = 0. The third property says that the projector onto the orthogonal complement of S (C), (denoted as S⊥ (C), can be calculated in a very simple manner using the projector 5.2 Orthogonal Projection Interpretation 95 onto S (C). In fact, is easy to show that for every d ∈ Rn , we can write: d = P [S (C)] d + P [ S⊥ (C) ] d. (5.26) It follows immediately that: e = (In − P [S (C)]) d. (5.27) As a final remark we would like to point out that the uniqueness of dˆopt does not mean that wˆ is unique. In fact, an infinite number of wˆ which give the same dˆopt can be possible. The vector wˆ will be unique just when there is only one way of writing dˆopt in terms of u(i) = 0, i = 1, . . . , L . That will be the case when u(i) are linearly independent, which is equivalent to the full column rank condition on C. This is consistent with (5.8). 5.3 Weighted Least Squares (WLS) In the previous section we showed that the LS problem formulated in (5.4) is basically an orthogonal projection problem in the Euclidean space Rn with the canonical inner product. Many inner products can be defined in Rn and a natural question would be how to generalize the LS solution when we are using a non-canonical inner product. A general inner product on Rn can be written as: 〈a, b〉M = aT Mb, (5.28) where M ∈ Rn×n is a symmetric positive definite matrix. The canonical inner product corresponds to M = In . From the inner product we can also define a norm: ‖a‖2M = aT Ma. (5.29) The use of a generalized norm allows to weigh the importance of the entries of vector d and matrix C. In this manner, and for a given symmetric positive definite matrix M, the WLS problem can be written as: wˆ = arg min w∈RL ‖d − Cw‖2M, (5.30) As in the LS problem, this is an orthogonal projection problem with the inner product given by (5.28). Extending (5.22), it holds: CT Me = 0, (5.31) 96 5 Least Squares where e = d − Cwˆ is the error vector at the optimal wˆ. In a straightforward manner we can obtain: CT MCwˆ = CT Md, (5.32) which is the normal equation for this problem. At this point, the same considerations made for the LS problem can be made in this case. To keep things simple we will consider that M1/2C is a full column rank matrix.7 In that case CT MC is invertible and the optimal wˆ can be obtained as: wˆ = ( CT MC )−1 CT Md (5.33) The unique optimal dˆopt can be written as: dˆopt = C ( CT MC )−1 CT Md, (5.34) which means that the projection matrix PM [S (C)] can be put as: PM [S (C)] = C ( CT MC )−1 CT M. (5.35) Except for the symmetry condition, PM [S (C)] has the properties mentioned previ- ously for P [S (C)]. 5.4 Some Statistical Properties of the LS Solution The appealing nature of the LS formulation (5.4) is reinforced by some interesting and useful properties. In this section we will mention some of them without proof.8 The interested reader can see [7] and [9] for a more detailed analysis. Consider (5.2), rewritten below for ease of reference: d = CwT + v. (5.36) Let us consider v = [v(0) v(1) . . . v(n − 1)], where v(i) is a sequence of zero-mean and i.i.d. random variables with E [ v2(i) ] = σ 2v . Although the matrix C could be composed of observations of a random process, it will be assumed fixed and perfectly known. That is, the only randomness in the model is due to v(i). The LS solution wˆ can be thought as an estimate of wT . The following holds: 1. The LS solution wˆ is an unbiased estimator. That is: 7 Notice that as M is a symmetric positive definite matrix, its square root is well defined [3]. 8 We will restrict ourselves to the regular LS. Similar properties for the more general WLS also exist. 5.4 Some Statistical Properties of the LS Solution 97 E [ wˆ ] = wT , (5.37) where the expectation is with respect to the distribution of the noise v. 2. The LS solution is the best linear unbiased estimator (BLUE) of wT when v(i) is a sequence of zero-mean and i.i.d. random variables. That is: KLS ≤ K, (5.38) where, assuming that C is full column rank, KLS = E [( wˆ − wT ) ( wˆ − wT )T ] = σ 2v ( CT C )−1 (5.39) is the covariance matrix of the LS estimator. Matrix K is the covariance matrix of any other linear unbiased estimator (i.e., wˇ = Ad, with A ∈ RL×n and AC = IL ). Equation (5.38) means that the covariance matrix of the LS estimator is less positive definite9 than the covariance matrix of any other linear unbiased estimator of wT . This implies: E [ ‖wˆ − wT ‖2 ] = tr (KLS) ≤ tr (K) = E [ ‖wˆ − wT ‖2 ] , (5.40) which means that the LS estimator has a lower dispersion or variance than any other linear unbiased estimator. This is a very important property, because it means that when we restrict ourselves to simple linear estimators of wT [(and according to a data model as in (5.36)], the LS solution is the one with the lowest variance. 3. When the noise sequence v(i) is zero-mean i.i.d. and Gaussian distributed, the LS estimator wˆ is not only the best linear unbiased estimator, but the best of all unbiased estimators, even nonlinear ones. In more precise terms, the covariance matrix KLS attains the Cramér-Rao lower bound 10 on the covariance matrix of any general estimator of wT . In terms of mean square error, there is no better estimator than the LS solution. The reader should be cautious, as this result depends heavily on the Gaussian assumption on v(i). If the noise is not Gaussian, this result might not be true. However, the property 2 is always true. 9 Similarly to Eq. (4.93) we consider the usual partial ordering on the symmetric positive definite matrices [3]. 10 The Cramér-Rao lower bound is a general lower bound on the variance of any estimator of a given parameter. It depends on the pdf of the random variables that influence a given observation linked to the parameter to be estimated. As an absolute lower bound, the obtention of an estimator that is able to attain it is very relevant. When such an estimator exists, it is said to be an efficient estimator. In the majority of the cases (specially when the random variables involved are not Gaussian), it is not possible to obtain an efficient estimator. For more details on the Cramér-Rao lower bound the interested reader could see [9] and [10]. 98 5 Least Squares 5.5 Recursive Least Squares (RLS) Consider Eq. (5.8), which will be rewritten as: w(n − 1) = ( CTn−1Cn−1 )−1 CTn−1dn−1. (5.41) In (5.41) we have explicitly denoted that the quantities in (5.3) are generated using n input-output pairs available up to time n−1.11 In practice it is interesting to deal with the situation in which n increases. In some situations, it would be desirable to obtain an LS solution for a given number of input-output pairs without waiting to have them all. Each time a new pair is available, a new LS solution taking into account the new input-output pair would be desirable. This situation would be the case when input-output pairs are obtained in real-time, arriving in a sequential manner. The calculation of the new LS solution would imply the implementation of (5.41) with Cn and dn . The main problem with this, is the need to obtain the inverse of CTn Cn . If this matrix is of L × L , the computational cost of obtaining its inverse is O (L3). If the dimension L is large, it can be easily seen that the cost of obtaining the inversion of an L × L matrix for every input-output pair received can be overwhelming. On the other hand, we see that the vector dn and the matrix Cn can be written as: dn = [ dn−1 d(n) ] , Cn = [ Cn−1 xT (n) ] . (5.42) These relations show us that in Cn there is an important overlapping of information with Cn−1. The natural question is: could this structure in Cn and dn be exploited in some smart way in order to reduce the aforementioned computational requirement? The answer is affirmative as we will show next. First, we consider the following weighted cost function for the LS problem12: w(n) = arg min w∈RL ‖dn − Cnw‖2Λn , (5.43) where Λn is an (n + 1) × (n + 1) diagonal matrix given by Λn = diag ( λn λn−1 . . . λ 1 ) , 0 < λ ≤ 1. (5.44) Defining the error signal e(i) = d(i) − xT (i)w, i = 0, . . . , n, we can write: 11 Notice that, in order to simplify the notation a little bit, we use w instead of wˆ which was used in the previous sections to denote the LS solution. 12 In the following, and without loss of generality we will concentrate on the RLS algorithm known as Exponentially Weighted RLS (EWRLS), which generalizes the standard RLS algorithm (with λ = 1) and it is by far the most used and popular version. Other important variant is the sliding window RLS algorithm [11]. 5.5 Recursive Least Squares (RLS) 99 ‖dn − Cnw‖2Λn = n∑ i=0 λn−i |e(i)|2. (5.45) Notice that the cost function in (5.45) presents a forgetting factor given by λ. As time progresses, the above cost forgets the distant past, and gives more importance to the present and immediate past. This weighting in t

                    本文档为【Least Squares】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Least Squares

你可能还喜欢