首页 cs229-linalg

cs229-linalg

cs229-linalg Linear Algebra Review and Reference Zico Kolter October 16, 2007 1 Basic Concepts and Notation Linear algebra provides a way of compactly representing and operating on sets of linear equations. For example, consider the following system of equations: 4x1...

Linear Algebra Review and Reference Zico Kolter October 16, 2007 1 Basic Concepts and Notation Linear algebra provides a way of compactly representing and operating on sets of linear equations. For example, consider the following system of equations: 4x1 − 5x2 = −13 −2x1 + 3x2 = 9 . This is two equations and two variables, so as you know from high school algebra, you can find a unique solution for x1 and x2 (unless the equations are somehow degenerate, for example if the second equation is simply a multiple of the first, but in the case above there is in fact a unique solution). In matrix notation, we can write the system more compactly as: Ax = b with A = [ 4 −5 −2 3 ] , b = [ 13 −9 ] . As we will see shortly, there are many advantages (including the obvious space savings) to analyzing linear equations in this form. 1.1 Basic Notation We use the following notation: • By A ∈ Rm×n we denote a matrix with m rows and n columns, where the entries of A are real numbers. • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector — i.e., a matrix with n rows and 1 column. If we want to explicitly represent a row vector — a matrix with 1 row and n columns — we typically write xT (here xT denotes the transpose of x, which we will define shortly). 1 • The ith element of a vector x is denoted xi: x =   x1 x2 ... xn   . • We use the notation aij (or Aij, Ai,j, etc) to denote the entry of A in the ith row and jth column: A =   a11 a12 · · · a1n a21 a22 · · · a2n ... ... . . . ... am1 am2 · · · amn   . • We denote the jth column of A by aj or A:,j: A =   | | |a1 a2 · · · an | | |   . • We denote the ith row of A by aTi or Ai,:: A =   — aT1 — — aT2 — ... — aTm —   . • Note that these definitions are ambiguous (for example, the a1 and a T 1 in the previous two definitions are not the same vector). Usually the meaning of the notation should be obvious from its use. 2 Matrix Multiplication The product of two matrices A ∈ Rm×n and B ∈ Rn×p is the matrix C = AB ∈ Rm×p, where Cij = n∑ k=1 AikBkj. Note that in order for the matrix product to exist, the number of columns in A must equal the number of rows in B. There are many ways of looking at matrix multiplication, and we’ll start by examining a few special cases. 2 2.1 Vector-Vector Products Given two vectors x, y ∈ Rn, the quantity xTy, sometimes called the inner product or dot product of the vectors, is a real number given by xTy ∈ R = n∑ i=1 xiyi. Note that it is always the case that xTy = yTx. Given vectors x ∈ Rm, y ∈ Rn (they no longer have to be the same size), xyT is called the outer product of the vectors. It is a matrix whose entries are given by (xyT )ij = xiyj, i.e., xyT ∈ Rm×n =   x1y1 x1y2 · · · x1yn x2y1 x2y2 · · · x2yn ... ... . . . ... xmy1 xmy2 · · · xmyn   . 2.2 Matrix-Vector Products Given a matrix A ∈ Rm×n and a vector x ∈ Rn, their product is a vector y = Ax ∈ Rm. There are a couple ways of looking at matrix-vector multiplication, and we will look at them both. If we write A by rows, then we can express Ax as, y =   — aT1 — — aT2 — ... — aTm —  x =   aT1 x aT2 x ... aTmx   . In other words, the ith entry of y is equal to the inner product of the ith row of A and x, yi = a T i x. Alternatively, lets write A in column form. In this case we see that, y =   | | |a1 a2 · · · an | | |     x1 x2 ... xn   =   a1  x1 +   a2  x2 + . . .+   an  xn . In other words, y is a linear combination of the columns of A, where the coefficients of the linear combination are given by the entries of x. So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a row vector. This is written, yT = xTA for A ∈ Rm×n, x ∈ Rm, and y ∈ Rn. As before, we can express yT in two obvious ways, depending on whether we 3 express A in terms on its rows or columns. In the first case we express A in terms of its columns, which gives yT = xT   | | |a1 a2 · · · an | | |   = [ xTa1 xTa2 · · · xTan ] which demonstrates that the ith entry of yT is equal to the inner product of x and the ith column of A. Finally, expressing A in terms of rows we get the final representation of the vector-matrix product, yT = [ x1 x2 · · · xn ]   — aT1 — — aT2 — ... — aTm —   = x1 [ — aT1 — ] + x2 [ — aT2 — ] + ...+ xn [ — aTn — ] so we see that yT is a linear combination of the rows of A, where the coefficients for the linear combination are given by the entries of x. 2.3 Matrix-Matrix Products Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the matrix-matrix multiplication C = AB as defined at the beginning of this section. First we can view matrix-matrix multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the definition, is that the i, j entry of C is equal to the inner product of the ith row of A and the jth row of B. Symbolically, this looks like the following, C = AB =   — aT1 — — aT2 — ... — aTm —     | | |b1 b2 · · · bp | | |   =   aT1 b1 a T 1 b2 · · · a T 1 bp aT2 b1 a T 2 b2 · · · a T 2 bp ... ... . . . ... aTmb1 a T mb2 · · · a T mbp   . Remember that since A ∈ Rm×n and B ∈ Rn×p, ai ∈ R n and bj ∈ R n, so these inner products all make sense. This is the most “natural” representation when we represent A by rows and B by columns. Alternatively, we can represent A by columns, and B by rows, which leads to the interpretation of AB as a sum of outer products. Symbolically, C = AB =   | | |a1 a2 · · · an | | |     — bT1 — — bT2 — ... — bTn —   = n∑ i=1 aib T i . 4 Put another way, AB is equal to the sum, over all i, of the outer product of the ith column of A and the ith row of B. Since, in this case, ai ∈ R m and bi ∈ R p, the dimension of the outer product aib T i is m× p, which coincides with the dimension of C. Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we represent B by columns, we can view the columns of C as matrix-vector products between A and the columns of B. Symbolically, C = AB = A   | | |b1 b2 · · · bp | | |   =   | | |Ab1 Ab2 · · · Abp | | |   . Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi. These matrix-vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the analogous viewpoint, where we repre- sent A by rows, and view the rows of C as the matrix-vector product between the rows of A and C. Symbolically, C = AB =   — aT1 — — aT2 — ... — aTm —  B =   — aT1B — — aT2B — ... — aTmB —   . Here the ith row of C is given by the matrix-vector product with the vector on the left, cTi = a T i B. It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these viewpoints follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section. However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend some time trying to develop an intuitive understanding of the viewpoints presented here. In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level: • Matrix multiplication is associative: (AB)C = A(BC). • Matrix multiplication is distributive: A(B + C) = AB + AC. • Matrix multiplication is, in general, not commutative; that is, it can be the case that AB 6= BA. 3 Operations and Properties In this section we present several operations and properties of matrices and vectors. Hope- fully a great deal of this will be review for you, so the notes can just serve as a reference for these topics. 5 3.1 The Identity Matrix and Diagonal Matrices The identity matrix , denoted I ∈ Rn×n, is a square matrix with ones on the diagonal and zeros everywhere else. That is, Iij = { 1 i = j 0 i 6= j It has the property that for all A ∈ Rm×n, AI = A = IA where the size of I is determined by the dimensions of A so that matrix multiplication is possible. A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denoted D = diag(d1, d2, . . . , dn), with Dij = { di i = j 0 i 6= j Clearly, I = diag(1, 1, . . . , 1). 3.2 The Transpose The transpose of a matrix results from “flipping” the rows and columns. Given a matrix A ∈ Rm×n, is transpose, written AT , is defined as AT ∈ Rn×m, (AT )ij = Aji . We have in fact already been using the transpose when describing row vectors, since the transpose of a column vector is naturally a row vector. The following properties of transposes are easily verified: • (AT )T = A • (AB)T = BTAT • (A+B)T = AT +BT 3.3 Symmetric Matrices A square matrix A ∈ Rn×n is symmetric if A = AT . It is anti-symmetric if A = −AT . It is easy to show that for any matrix A ∈ Rn×n, the matrix A + AT is symmetric and the matrix A−AT is anti-symmetric. From this it follows that any square matrix A ∈ Rn×n can be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since A = 1 2 (A+ AT ) + 1 2 (A− AT ) 6 and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matrices occur a great deal in practice, and they have many nice properties which we will look at shortly. It is common to denote the set of all symmetric matrices of size n as Sn, so that A ∈ Sn means that A is a symmetric n× n matrix; 3.4 The Trace The trace of a square matrix A ∈ Rn×n, denoted tr(A) (or just trA if the parentheses are obviously implied), is the sum of diagonal elements in the matrix: trA = n∑ i=1 Aii. As described in the CS229 lecture notes, the trace has the following properties (included here for the sake of completeness): • For A ∈ Rn×n, trA = trAT . • For A,B ∈ Rn×n, tr(A+B) = trA+ trB. • For A ∈ Rn×n, t ∈ R, tr(tA) = t trA. • For A,B such that AB is square, trAB = trBA. • For A,B,C such that ABC is square, trABC = trBCA = trCAB, and so on for the product of more matrices. 3.5 Norms A norm of a vector ‖x‖ is informally measure of the “length” of the vector. For example, we have the commonly-used Euclidean or ℓ2 norm, ‖x‖2 = √√√√ n∑ i=1 x2i . Note that ‖x‖22 = x Tx. More formally, a norm is any function f : Rn → R that satisfies 4 properties: 1. For all x ∈ Rn, f(x) ≥ 0 (non-negativity). 2. f(x) = 0 if and only if x = 0 (definiteness). 3. For all x ∈ Rn, t ∈ R, f(tx) = |t|f(x) (homogeneity). 4. For all x, y ∈ Rn, f(x+ y) ≤ f(x) + f(y) (triangle inequality). 7 Other examples of norms are the ℓ1 norm, ‖x‖1 = n∑ i=1 |xi| and the ℓ∞ norm, ‖x‖∞ = maxi|xi|. In fact, all three norms presented so far are examples of the family of ℓp norms, which are parameterized by a real number p ≥ 1, and defined as ‖x‖p = ( n∑ i=1 |xi| p )1/p . Norms can also be defined for matrices, such as the Frobenius norm, ‖A‖F = √√√√ m∑ i=1 n∑ j=1 A2ij = √ tr(ATA). Many other norms exist, but they are beyond the scope of this review. 3.6 Linear Independence and Rank A set of vectors {x1, x2, . . . xn} is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, a vector which can be represented as a linear combination of the remaining vectors is said to be (linearly) dependent . For example, if xn = n−1∑ i=1 αixi for some {α1, . . . , αn−1} then xn is dependent on {x1, . . . , xn−1}; otherwise, it is independent of {x1, . . . , xn−1}. The column rank of a matrix A is the largest number of columns of A that constitute linearly independent set. This is often referred to simply as the number of linearly indepen- dent columns, but this terminology is a little sloppy, since it is possible that any vector in some set {x1, . . . xn} can be expressed as a linear combination of the remaining vectors, even though some subset of the vectors might be independent. In the same way, the row rank is the largest number of rows of A that constitute a linearly independent set. It is a basic fact of linear algebra, that for any matrix A, columnrank(A) = rowrank(A), and so this quantity is simply refereed to as the rank of A, denoted as rank(A). The following are some basic properties of the rank: • For A ∈ Rm×n, rank(A) ≤ min(m,n). If rank(A) = min(m,n), then A is said to be full rank . 8 • For A ∈ Rm×n, rank(A) = rank(AT ). • For A ∈ Rm×n, B ∈ Rn×p, rank(AB) ≤ min(rank(A), rank(B)). • For A,B ∈ Rm×n, rank(A+B) ≤ rank(A) + rank(B). 3.7 The Inverse The inverse of a square matrix A ∈ Rn×n is denoted A−1, and is the unique matrix such that A−1A = I = AA−1. It turns out that A−1 may not exist for some matrices A; we say A is invertible or non- singular if A−1 exists and non-invertible or singular otherwise. One condition for invertibility we already know: it is possible to show that A−1 exists if and only if A is full rank. We will soon see that there are many alternative sufficient and necessary conditions, in addition to full rank, for invertibility. The following are properties of the inverse; all assume that A,B ∈ Rn×n are non-singular: • (A−1)−1 = A • If Ax = b, we can multiply by A−1 on both sides to obtain x = A−1b. This demonstrates the inverse with respect to the original system of linear equalities we began this review with. • (AB)−1 = B−1A−1 • (A−1)T = (AT )−1. For this reason this matrix is often denoted A−T . 3.8 Orthogonal Matrices Two vectors x, y ∈ Rn are orthogonal if xTy = 0. A vector x ∈ Rn is normalized if ‖x‖2 = 1. A square matrix U ∈ R n×n is orthogonal (note the different meanings when talking about vectors versus matrices) if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being orthonormal). It follows immediately from the definition of orthogonality and normality that UTU = I = UUT . In other words, the inverse of an orthogonal matrix is its transpose. Note that if U is not square — i.e., U ∈ Rm×n, n < m — but its columns are still orthonormal, then UTU = I, but UUT 6= I. We generally only use the term orthogonal to describe the previous case, where U is square. Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e., ‖Ux‖2 = ‖x‖2 for any x ∈ Rn, U ∈ Rn×n orthogonal. 9 3.9 Range and Nullspace of a Matrix The span of a set of vectors {x1, x2, . . . xn} is the set of all vectors that can be expressed as a linear combination of {x1, . . . , xn}. That is, span({x1, . . . xn}) = { v : v = n∑ i=1 αixi, αi ∈ R } . It can be shown that if {x1, . . . , xn} is a set of n linearly independent vectors, where each xi ∈ R n, then span({x1, . . . xn}) = R n. In other words, any vector v ∈ Rn can be written as a linear combination of x1 through xn. The projection of a vector y ∈ R m onto the span of {x1, . . . , xn} (here we assume xi ∈ R m) is the vector v ∈ span({x1, . . . xn}) , such that v as close as possible to y, as measured by the Euclidean norm ‖v − y‖2. We denote the projection as Proj(y; {x1, . . . , xn}) and can define it formally as, Proj(y; {x1, . . . xn}) = argminv∈span({x1,...,xn})‖y − v‖2. The range (sometimes also called the columnspace) of a matrix A ∈ Rm×n, denoted R(A), is the the span of the columns of A. In other words, R(A) = {v ∈ Rm : v = Ax, x ∈ Rn}. Making a few technical assumptions (namely that A is full rank and that n < m), the projection of a vector y ∈ Rm onto the range of A is given by, Proj(y;A) = argminv∈R(A)‖v − y‖2 = A(A TA)−1ATy . This last equation should look extremely familiar, since it is almost the same formula we derived in class (and which we will soon derive again) for the least squares estimation of parameters. Looking at the definition for the projection, it should not be too hard to convince yourself that this is in fact the same objective that we minimized in our least squares problem (except for a squaring of the norm, which doesn’t affect the optimal point) and so these problems are naturally very connected. When A contains only a single column, a ∈ Rm, this gives the special case for a projection of a vector on to a line: Proj(y; a) = aaT aTa y . The nullspace of a matrix A ∈ Rm×n, denoted N (A) is the set of all vectors that equal 0 when multiplied by A, i.e., N (A) = {x ∈ Rn : Ax = 0}. Note that vectors in R(A) are of size m, while vectors in the N (A) are of size n, so vectors in R(AT ) and N (A) are both in Rn. In fact, we can say much more. It turns out that{ w : w = u+ v, u ∈ R(AT ), v ∈ N (A) } = Rn and R(AT ) ∩N (A) = ∅ . In other words, R(AT ) and N (A) are disjoint subsets that together span the entire space of R n. Sets of this type are called orthogonal complements , and we denote this R(AT ) = N (A)⊥. 10 3.10 The Determinant The determinant of a square matrix A ∈ Rn×n, is a function det : Rn×n → R, and is denoted |A| or detA (like the trace operator, we usually omit parentheses). The full formula for the determinant gives little intuition about its meaning, so we instead first give three defining properties of the determinant, from which all the rest follow (including the general formula): 1. The determinant of the identity is 1, |I| = 1. 2. Given a matrix A ∈ Rn×n, if we multiply a single row in A by a scalar t ∈ R, then the determinant of the new matrix is t|A|,∣∣∣∣∣∣∣∣∣   — t aT1 — — aT2 — ... — aTm —   ∣∣∣∣∣∣∣∣∣ = t|A| . 3. If we exchange any two rows aTi and a T j of A, then the determinant of the new matrix is −|A|, for example ∣∣∣∣∣∣∣∣∣   — aT2 — — aT1 — ... — aTm —   ∣∣∣∣∣∣∣∣∣ = −|A| . These properties, however, also give very little intuition about the nature of the deter- minant, so we now list several properties that follow from the three properties above: • For A ∈ Rn×n, |A| = |AT |. • For A,B ∈ Rn×n, |AB| = |A||B|. • For A ∈ Rn×n, |A| = 0 if and only if A is singular (i.e., non-invertible). • For A ∈ Rn×n and A non-singular, |A|−1 = 1/|A|. Before given the general definition for the determinant, we define, for A ∈ Rn×n, A\i,\j ∈ R (n−1)×(n−1) to be the matrix that results from deleting the ith row and jth column from A. The general (recursive) formula for the determinant is |A| = n∑ i=1 (−1)i+jaij|A\i,\j| (for any j ∈ 1, . . . , n) = n∑ j=1 (−1)i+jaij|A\i,\j| (for any i ∈ 1, . . . , n) 11 with the initial case that |A| = a11 for A ∈ R 1×1. If we were to expand this formula completely for A ∈ Rn×n, there would be a total of n! (n factorial) different terms. For this reason, we hardly even explicitly write the complete equation of the determinant for matrices bigger than 3× 3. However, the equations for determinants of matrices up to size 3× 3 are fairly common, and it is good to know them: |[a11]| = a11∣∣∣∣ [ a11 a12 a21 a22 ]∣∣∣∣ = a11a22 − a12a21∣∣∣∣∣∣   a11 a12 a13a21 a22 a23 a31 a32 a33   ∣∣∣∣∣∣ = a11a22a33 + a12a23a31 + a13a21a32 −a11a23a32 − a12a21a33 − a13a22a31 The classical adjoint (often just called the adjoint) of a matrix A ∈ Rn×n, is denoted adj(A), and defined as adj(A) ∈ Rn×n, (adj(A))ij = (−1) i+j|A\j,\i| (note the switch in the indices A\j,\i). It can be shown that for any nonsingular A ∈ R n×n, A−1 = 1 |A| adj(A) . While this is a nice “explicit” formula for the inverse of matrix, we should note that, numer- ically, there are in fact much more efficient ways of computing the inverse. 3.11 Quadratic Forms and Positive Semidefinite Matrices Given a matrix square A ∈ Rn×n and a vector x ∈ R, the scalar value xTAx is called a quadratic form . Written explicitly, we see that xTAx = n∑ i=1

                    本文档为【cs229-linalg】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

cs229-linalg

你可能还喜欢