Chapter 5
Least Squares
Abstract In this chapter we will cover the basics of the celebrated method of Least
Squares (LS). The approach to this method is different from the stochastic gradient
approach from the previous chapter. As always, the idea will be to obtain an estimation
of a given system using input-output measured pairs (and no statistical information),
and assuming a model in which the input and output pairs are linearly related. We will
also present the Recursive Least Squares (RLS) algorithm, which will be a recursive
and a more computational efficient implementation of the LS method. One of its
advantage is that it can be used in real time as the input-output pairs are received.
In this sense, it will be very similar to the adaptive filters obtained in the previous
chapter. Several important properties of LS and RLS will be discussed.
5.1 The Least Squares (LS) Problem
Assume that we have the following quantities:
{d(0), d(1), . . . , d(n − 1)} , {x(0), x(1), . . . , x(n − 1)} (5.1)
where x(i) ∈ RL ,∀i = 0, 1, . . . , n −1. The values d(i) and the vectors x(i)1 can be
assumed as the output and input from a given physical system. As discussed in Sect.
2.3, even though the actual relationship between the input and output pairs might not
be linear, we can always write:
d = CwT + v, (5.2)
1 Notice that at this moment of the presentation we are not assuming that the vectors x(i) are
generated from a scalar signal x(i) through a tapped-delay line structure.
L. Rey Vega and H. Rey, A Rapid Introduction to Adaptive Filtering, 89
SpringerBriefs in Electrical and Computer Engineering,
DOI: 10.1007/978-3-642-30299-2_5, © The Author(s) 2013
90 5 Least Squares
where:
d = [d(0) d(1) . . . d(n − 1)]T and C =
⎡
⎢⎢⎢⎣
xT (0)
xT (1)
...
xT (n − 1)
⎤
⎥⎥⎥⎦ . (5.3)
Notice that d ∈ Rn and C ∈ Rn×L . As in the linear regression model (2.11), the
vector wT represents the true system when it exists and is linear, or a first linear
approximation to the general input-output relation of the given data pairs. The vector
v could represent modeling errors, additive noise perturbations, etc. The question is
how to obtain wT based on the information given by d and C. As, we do not have
any statistical information about the perturbation v it seems natural to search for wˆ
such that:
wˆ = arg min
w∈RL
‖d − Cw‖2, (5.4)
where ‖ · ‖ denotes the usual canonical Euclidean norm [1]. The criterion in (5.4) is
known as the Least Squares problem and is satisfactory for two reasons. Firstly, as
we are assuming that there is no statistical information in our problem, it makes sense
to adopt a purely deterministic criterion of optimization, using only the input-output
pairs given. The idea is to find a vector wˆ that explains the data given in the best
manner possible using familiar Euclidean distances. Secondly, observe that in the
special case that the input-output pairs come from jointly ergodic processes [2], we
can write:
1
n
‖d − Cw‖2 = 1
n
n−1∑
i=0
|d(i) − xT (i)w|2 −→
n→∞ E
[
|d(i) − xT (i)w|2
]
. (5.5)
So, if n is sufficiently large and the input-output pairs come from ergodic processes,
the solution to the problem (5.4)2 coincides with the Wiener solution from Chap. 2.
5.1.1 Solution to the LS Problem
In order to solve the problem in (5.4) we write
‖d − Cw‖2 = ‖d‖2 − 2wT CT d + wT CT Cw. (5.6)
We can differentiate (5.6) with respect to w and equal to zero to obtain:
CT Cw = CT d. (5.7)
2 The reader should notice that the factor 1
n
does not modify the solution in (5.4).
5.1 The Least Squares (LS) Problem 91
Equation (5.7) is known as the normal equation. It can be shown that the normal
equation always has a solution.3 Depending on the dimension of C, (5.7) can have
one or infinite solutions:
• n ≥ L: As C has more rows than columns, the numbers of equations in (5.2)
is greater than the numbers of unknowns in w. It is said that the system in (5.2)
is a overdetermined one. In this case there will be one or an infinite number of
solutions to (5.7), depending on the column rank of C. If C has full column rank
the solution is unique. If it is column rank deficient there exist an infinite number
of solutions.
• L < n: In this case C has more columns than rows. That is, in (5.2) there are more
unknowns than equations. It is said that the system in (5.2) is an underdetermined
one. In this case, the number of solutions to (5.7) is always infinite. This is because
C would always be column rank deficient.
The most common situation, at least in adaptive filtering,4 will be the first one, in
which n ≥ L . Suppose then that C has full column rank. It can be easily shown that
CT C ∈ RL×L has full rank and is invertible. It is obvious that the solution to (5.7)
is unique and:
wˆ =
(
CT C
)−1
CT d. (5.8)
When C has not full column rank, the inverse of CT C does not exist. In this case,
the number of solutions to (5.7) is infinite. It is clear that as CT C is singular, any
solution to (5.7) can be put as:
wˆ = wˆp + wˆh, (5.9)
where wˆp is any solution to (5.7) and wˆh is any solution to:
CT Cwˆh = 0, (5.10)
that is, wˆh is any vector in N
(
CT C
)
,
5 the null space of CT C, which can be easily
seen to be equal to N (C). It is also clear that the difference between two solutions
wˆ1 and wˆ2 is a vector belonging to N (C).
A natural question when (5.7) has multiple solutions is which one would be the
best choice. There is no simple answer to that problem. A common choice would be
the following:
wˆ = arg min
w∈RL
‖w‖2 subject to CT d = CT Cw. (5.11)
3 Notice that R
(
CT C
) = R (CT ) [3], where R (A) denotes the range or column space of the
matrix A. This implies that it should exist at least one w which satisfies Eq. (5.7).
4 In recent years there was a growing interest in underdetermined systems with sparseness conditions
on the solution. This is commonly known as compressed sensing or compressed sampling. The
interested reader can see [4] and the references therein.
5 As CT C is singular N
(
CT C
)
contains others vectors besides the null vector.
92 5 Least Squares
This means that we would choose the solution of (5.7) with minimum canonical
Euclidean norm. Then, given a particular solution wˆp, we need to search over the
vectors wˆh ∈ N (C) that lead to wˆ = wˆp + wˆh having minimum norm. Although
this seems to be difficult, the fact is that the solution to problem (5.11) is very simple
and is given by
wˆ = C†d, (5.12)
where C† denotes the Moore-Penrose pseudoinverse of C. The pseudoinverse of a
general rectangular matrix C is a generalization of the concept of inverse. Its calcula-
tion in the most general case can be obtained from the singular value decomposition
(SVD) of a matrix C [5]. The SVD is a very powerful technique to decompose a
general rectangular matrix.
5.1.1.1 Singular Value Decomposition and Pseudoinverse
Let C ∈ Rn×L be an arbitrary rectangular matrix. Matrix C can be written as:
• If L ≤ n:
C = U
[
�
0
]
VT , � ∈ RL×L . (5.13)
• If L ≥ n:
C = U [� 0] VT , � ∈ Rn×n . (5.14)
with U ∈ Rn×n and V ∈ RL×L being orthonormal matrices6 whose columns are
the eigenvectors of CCT and CT C respectively. Square matrix � is diagonal with
positive entries, that is:
� = diag (σ1, σ2, . . . , σK , 0, . . . , 0) , (5.15)
where σ 2i , i = 1, . . . , K , with K = rank(C) ≤ min (n, L), are the non-null
eigenvalues of either CCT or CT C.
The SVD is a very important tool in linear algebra, matrix analysis and signal
processing. The reader interested in more details on SVD can see [5, 6] and the
references therein. The pseudoinverse C† can be written is terms of its SVD as [7]:
• If L ≤ n:
C† = V
[
�−1 0
]
UT . (5.16)
• If L ≥ n:
C† = V
[
�−1
0
]
UT . (5.17)
6 UT U = UUT = In and VT V = VVT = IL .
5.1 The Least Squares (LS) Problem 93
Clearly, C†C = IL and CC† = In . When L ≤ n and C has full column rank, it is
easy to show that:
C† =
(
CT C
)−1
CT , (5.18)
which when replaced in (5.12) leads to the unique solution (5.8).
5.2 Orthogonal Projection Interpretation
The theory of LS is a very elegant one. Behind the LS problem is the concept of
orthogonal projection as in the Wiener filtering case (see Sect. 2.3). In the following,
we will show that the solution to the LS problem can be interpreted as a Euclidean
orthogonal projection onto an appropriate subspace.
Let us call S (C) the subspace generated by the columns of C. From (5.4) it is
clear that the quantity dˆ � Cw, for any vector w ∈ RL (even the optimal one wˆ),
can be written as:
dˆ =
L∑
i=1
wi u(i), (5.19)
where u(i), i = 1, . . . , L are the columns of C. As the right hand side of (5.19) is
a linear combination of the columns of C, we can conclude that
dˆ ∈ S (C) . (5.20)
This means that for every choice of w ∈ RL , the value of dˆ has to necessarily belong
to S (C). In fact, problem (5.4) is equivalent to search for the vector dˆ ∈ S (C) which
is closer to d in a Euclidean distance sense. The vector wˆ can be interpreted as the
coordinates of the optimal dˆ in S (C).
The problem of approximating a given vector with another vector in a given
subspace, is a very well known problem in mathematics [8]. For an inner product
space (that is, a Hilbert space), it is known that the solution to that problem is an
orthogonal projection. In precise terms, and for our problem where the space is Rn
with the usual canonical Euclidean inner product, this means:
uT (i)e = 0, i = 1, . . . , L , (5.21)
where e = d−Cwˆ is the error between the original d and the optimal approximation
Cwˆ ∈ S (C). Equation (5.21) means that the approximation error for the optimal
solution has to be orthogonal to the columns of matrix C, which is equivalent to
being orthogonal to the entire subspace S (C). Let us define dˆopt � Cdˆ. In Fig. 5.1
we depict the geometrical properties of the orthogonal projection onto S (C).
94 5 Least Squares
Fig. 5.1 Orthogonal projec-
tion onto S (C)
It is easy to check that the optimal dˆopt ∈ S (C) is unique. This means that it is
the unique vector that satisfies (5.21), which can be put more compactly as:
CT e = 0. (5.22)
It is straightforward to see that the previous expression can be written as:
CT Cwˆ = CT d, (5.23)
which is the normal equation in (5.7). Assuming that C has full column rank, the
solution for the optimal wˆ is given by (5.8) and the value of dopt is given by:
dˆopt = C
(
CT C
)−1
CT d. (5.24)
The n × n matrix:
P [S (C)] = C
(
CT C
)−1
CT , (5.25)
is called projection matrix, and gives a representation for the operator that computes
the unique orthogonal projection of any vector in Rn onto S (C) . Projection matrices
have important properties [3]:
• P [S (C)] = PT [S (C)].
• P2 [S (C)] = P [S (C)].
• P [S⊥ (C)] = In − P [S (C)].
The second property says that a projection matrix is an idempotent operator. That is,
applying the operator twice or more times is equivalent to applying it just once. This
is equivalent to saying that if d ∈ S (C), then P [S (C)] d = d, which gives e = 0.
The third property says that the projector onto the orthogonal complement of S (C),
(denoted as S⊥ (C), can be calculated in a very simple manner using the projector
5.2 Orthogonal Projection Interpretation 95
onto S (C). In fact, is easy to show that for every d ∈ Rn , we can write:
d = P [S (C)] d + P
[
S⊥ (C)
]
d. (5.26)
It follows immediately that:
e = (In − P [S (C)]) d. (5.27)
As a final remark we would like to point out that the uniqueness of dˆopt does not
mean that wˆ is unique. In fact, an infinite number of wˆ which give the same dˆopt can
be possible. The vector wˆ will be unique just when there is only one way of writing
dˆopt in terms of u(i) = 0, i = 1, . . . , L . That will be the case when u(i) are linearly
independent, which is equivalent to the full column rank condition on C. This is
consistent with (5.8).
5.3 Weighted Least Squares (WLS)
In the previous section we showed that the LS problem formulated in (5.4) is basically
an orthogonal projection problem in the Euclidean space Rn with the canonical inner
product. Many inner products can be defined in Rn and a natural question would be
how to generalize the LS solution when we are using a non-canonical inner product.
A general inner product on Rn can be written as:
〈a, b〉M = aT Mb, (5.28)
where M ∈ Rn×n is a symmetric positive definite matrix. The canonical inner product
corresponds to M = In . From the inner product we can also define a norm:
‖a‖2M = aT Ma. (5.29)
The use of a generalized norm allows to weigh the importance of the entries of vector
d and matrix C. In this manner, and for a given symmetric positive definite matrix
M, the WLS problem can be written as:
wˆ = arg min
w∈RL
‖d − Cw‖2M, (5.30)
As in the LS problem, this is an orthogonal projection problem with the inner product
given by (5.28). Extending (5.22), it holds:
CT Me = 0, (5.31)
96 5 Least Squares
where e = d − Cwˆ is the error vector at the optimal wˆ. In a straightforward manner
we can obtain:
CT MCwˆ = CT Md, (5.32)
which is the normal equation for this problem. At this point, the same considerations
made for the LS problem can be made in this case. To keep things simple we will
consider that M1/2C is a full column rank matrix.7 In that case CT MC is invertible
and the optimal wˆ can be obtained as:
wˆ =
(
CT MC
)−1
CT Md (5.33)
The unique optimal dˆopt can be written as:
dˆopt = C
(
CT MC
)−1
CT Md, (5.34)
which means that the projection matrix PM [S (C)] can be put as:
PM [S (C)] = C
(
CT MC
)−1
CT M. (5.35)
Except for the symmetry condition, PM [S (C)] has the properties mentioned previ-
ously for P [S (C)].
5.4 Some Statistical Properties of the LS Solution
The appealing nature of the LS formulation (5.4) is reinforced by some interesting
and useful properties. In this section we will mention some of them without proof.8
The interested reader can see [7] and [9] for a more detailed analysis. Consider (5.2),
rewritten below for ease of reference:
d = CwT + v. (5.36)
Let us consider v = [v(0) v(1) . . . v(n − 1)], where v(i) is a sequence of zero-mean
and i.i.d. random variables with E
[
v2(i)
] = σ 2v . Although the matrix C could be
composed of observations of a random process, it will be assumed fixed and perfectly
known. That is, the only randomness in the model is due to v(i). The LS solution wˆ
can be thought as an estimate of wT . The following holds:
1. The LS solution wˆ is an unbiased estimator. That is:
7 Notice that as M is a symmetric positive definite matrix, its square root is well defined [3].
8 We will restrict ourselves to the regular LS. Similar properties for the more general WLS also
exist.
5.4 Some Statistical Properties of the LS Solution 97
E
[
wˆ
] = wT , (5.37)
where the expectation is with respect to the distribution of the noise v.
2. The LS solution is the best linear unbiased estimator (BLUE) of wT when v(i)
is a sequence of zero-mean and i.i.d. random variables. That is:
KLS ≤ K, (5.38)
where, assuming that C is full column rank,
KLS = E
[(
wˆ − wT
) (
wˆ − wT
)T ] = σ 2v
(
CT C
)−1
(5.39)
is the covariance matrix of the LS estimator. Matrix K is the covariance matrix
of any other linear unbiased estimator (i.e., wˇ = Ad, with A ∈ RL×n and
AC = IL ). Equation (5.38) means that the covariance matrix of the LS estimator
is less positive definite9 than the covariance matrix of any other linear unbiased
estimator of wT . This implies:
E
[
‖wˆ − wT ‖2
]
= tr (KLS) ≤ tr (K) = E
[
‖wˆ − wT ‖2
]
, (5.40)
which means that the LS estimator has a lower dispersion or variance than any
other linear unbiased estimator. This is a very important property, because it
means that when we restrict ourselves to simple linear estimators of wT [(and
according to a data model as in (5.36)], the LS solution is the one with the lowest
variance.
3. When the noise sequence v(i) is zero-mean i.i.d. and Gaussian distributed, the
LS estimator wˆ is not only the best linear unbiased estimator, but the best of all
unbiased estimators, even nonlinear ones. In more precise terms, the covariance
matrix KLS attains the Cramér-Rao lower bound 10 on the covariance matrix of
any general estimator of wT . In terms of mean square error, there is no better
estimator than the LS solution. The reader should be cautious, as this result
depends heavily on the Gaussian assumption on v(i). If the noise is not Gaussian,
this result might not be true. However, the property 2 is always true.
9 Similarly to Eq. (4.93) we consider the usual partial ordering on the symmetric positive definite
matrices [3].
10 The Cramér-Rao lower bound is a general lower bound on the variance of any estimator of a given
parameter. It depends on the pdf of the random variables that influence a given observation linked
to the parameter to be estimated. As an absolute lower bound, the obtention of an estimator that is
able to attain it is very relevant. When such an estimator exists, it is said to be an efficient estimator.
In the majority of the cases (specially when the random variables involved are not Gaussian), it is
not possible to obtain an efficient estimator. For more details on the Cramér-Rao lower bound the
interested reader could see [9] and [10].
98 5 Least Squares
5.5 Recursive Least Squares (RLS)
Consider Eq. (5.8), which will be rewritten as:
w(n − 1) =
(
CTn−1Cn−1
)−1
CTn−1dn−1. (5.41)
In (5.41) we have explicitly denoted that the quantities in (5.3) are generated using n
input-output pairs available up to time n−1.11 In practice it is interesting to deal with
the situation in which n increases. In some situations, it would be desirable to obtain
an LS solution for a given number of input-output pairs without waiting to have
them all. Each time a new pair is available, a new LS solution taking into account
the new input-output pair would be desirable. This situation would be the case when
input-output pairs are obtained in real-time, arriving in a sequential manner. The
calculation of the new LS solution would imply the implementation of (5.41) with
Cn and dn . The main problem with this, is the need to obtain the inverse of CTn Cn . If
this matrix is of L × L , the computational cost of obtaining its inverse is O (L3). If
the dimension L is large, it can be easily seen that the cost of obtaining the inversion
of an L × L matrix for every input-output pair received can be overwhelming. On
the other hand, we see that the vector dn and the matrix Cn can be written as:
dn =
[
dn−1
d(n)
]
, Cn =
[
Cn−1
xT (n)
]
. (5.42)
These relations show us that in Cn there is an important overlapping of information
with Cn−1. The natural question is: could this structure in Cn and dn be exploited in
some smart way in order to reduce the aforementioned computational requirement?
The answer is affirmative as we will show next.
First, we consider the following weighted cost function for the LS problem12:
w(n) = arg min
w∈RL
‖dn − Cnw‖2Λn , (5.43)
where Λn is an (n + 1) × (n + 1) diagonal matrix given by
Λn = diag
(
λn λn−1 . . . λ 1
)
, 0 < λ ≤ 1. (5.44)
Defining the error signal e(i) = d(i) − xT (i)w, i = 0, . . . , n, we can write:
11 Notice that, in order to simplify the notation a little bit, we use w instead of wˆ which was used
in the previous sections to denote the LS solution.
12 In the following, and without loss of generality we will concentrate on the RLS algorithm known
as Exponentially Weighted RLS (EWRLS), which generalizes the standard RLS algorithm (with
λ = 1) and it is by far the most used and popular version. Other important variant is the sliding
window RLS algorithm [11].
5.5 Recursive Least Squares (RLS) 99
‖dn − Cnw‖2Λn =
n∑
i=0
λn−i |e(i)|2. (5.45)
Notice that the cost function in (5.45) presents a forgetting factor given by λ. As time
progresses, the above cost forgets the distant past, and gives more importance to the
present and immediate past. This weighting in t
本文档为【Least Squares】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。