LIBSVM a Library for Support Vector Machines

LIBSVM a Library for Support Vector Machines LIBSVM: a Library for Support Vector Machines Chih-Chung Chang and Chih-Jen Lin∗ Last updated: December 27, 2006 Abstract LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we pr...

LIBSVM: a Library for Support Vector Machines Chih-Chung Chang and Chih-Jen Lin∗ Last updated: December 27, 2006 Abstract LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple- mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information. 1 Introduction LIBSVM is a library for support vector machines (SVM). Its goal is to let users can easily use SVM as a tool. In this document, we present all its implementation details. For using LIBSVM, the README file included in the package provides the information. In Section 2, we show formulations used in LIBSVM: C-support vector classifica- tion (C-SVC), ν-support vector classification (ν-SVC), distribution estimation (one- class SVM), �-support vector regression (�-SVR), and ν-support vector regression (ν-SVR). We discuss the implementation of solving quadratic problems in Section 3. Section 4 describes two implementation techniques: shrinking and caching. We also support different penalty parameters for unbalanced data. Details are in Section 5. Then Section 6 discusses the implementation of multi-class classification. Parameter selection is important for obtaining good SVM models. LIBSVM provides simple and useful tools, which are discussed in Section 7. Section 8 presents the implementation of probability outputs. 2 Formulations 2.1 C-Support Vector Classification Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rl such that yi ∈ {1,−1}, C-SVC (Cortes and Vapnik, 1995; Vapnik, 1998) solves the ∗Department of Computer Science, National Taiwan University, Taipei 106, Taiwan (http:// www.csie.ntu.edu.tw/∼cjlin). E-mail: cjlin@csie.ntu.edu.tw 1 following primal problem: min w,b,ξ 1 2 wTw + C l∑ i=1 ξi (2.1) subject to yi(w Tφ(xi) + b) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . , l. Its dual is min α 1 2 αTQα− eTα subject to yTα = 0, (2.2) 0 ≤ αi ≤ C, i = 1, . . . , l, where e is the vector of all ones, C > 0 is the upper bound, Q is an l by l positive semidefinite matrix, Qij ≡ yiyjK(xi,xj), and K(xi,xj) ≡ φ(xi)Tφ(xj) is the kernel. Here training vectors xi are mapped into a higher (maybe infinite) dimensional space by the function φ. The decision function is sgn ( l∑ i=1 yiαiK(xi,x) + b ) . 2.2 ν-Support Vector Classification The ν-support vector classification (Scho¨lkopf et al., 2000) uses a new parameter ν which controls the number of support vectors and training errors. The parameter ν ∈ (0, 1] is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rl such that yi ∈ {1,−1}, the primal form considered is: min w,b,ξ,ρ 1 2 wTw − νρ+ 1 l l∑ i=1 ξi subject to yi(w Tφ(xi) + b) ≥ ρ− ξi, ξi ≥ 0, i = 1, . . . , l, ρ ≥ 0. 2 The dual is: min α 1 2 αTQα subject to 0 ≤ αi ≤ 1/l, i = 1, . . . , l, (2.3) eTα ≥ ν, yTα = 0. where Qij ≡ yiyjK(xi,xj). The decision function is: sgn ( l∑ i=1 yiαi(K(xi,x) + b) ) . In (Crisp and Burges, 2000; Chang and Lin, 2001), it has been shown that eTα ≥ ν can be replaced by eTα = ν. With this property, in LIBSVM, we solve a scaled version of (2.3): min α 1 2 αTQα subject to 0 ≤ αi ≤ 1, i = 1, . . . , l, eTα = νl, yTα = 0. We output α/ρ so the computed decision function is: sgn ( l∑ i=1 yi(αi/ρ)(K(xi,x) + b) ) and then two margins are yi(w Tφ(xi) + b) = ±1 which are the same as those of C-SVC. 2.3 Distribution Estimation (One-class SVM) One-class SVM was proposed by Scho¨lkopf et al. (2001) for estimating the support of a high-dimensional distribution. Given training vectors xi ∈ Rn, i = 1, . . . , l without any class information, the primal form in (Scho¨lkopf et al., 2001) is: min w,b,ξ,ρ 1 2 wTw − ρ+ 1 νl l∑ i=1 ξi subject to wTφ(xi) ≥ ρ− ξi, ξi ≥ 0, i = 1, . . . , l. 3 The dual is: min α 1 2 αTQα subject to 0 ≤ αi ≤ 1/(νl), i = 1, . . . , l, (2.4) eTα = 1, where Qij = K(xi,xj) ≡ φ(xi)Tφ(xj). In LIBSVM we solve a scaled version of (2.4): min 1 2 αTQα subject to 0 ≤ αi ≤ 1, i = 1, . . . , l, eTα = νl. The decision function is sgn( l∑ i=1 αiK(xi,x)− ρ). 2.4 �-Support Vector Regression (�-SVR) Given a set of data points, {(x1, z1), . . . , (xl, zl)}, such that xi ∈ Rn is an input and zi ∈ R1 is a target output, the standard form of support vector regression (Vapnik, 1998) is: min w,b,ξ,ξ∗ 1 2 wTw + C l∑ i=1 ξi + C l∑ i=1 ξ∗i subject to wTφ(xi) + b− zi ≤ �+ ξi, zi −wTφ(xi)− b ≤ �+ ξ∗i , ξi, ξ ∗ i ≥ 0, i = 1, . . . , l. The dual is: min α,α∗ 1 2 (α−α∗)TQ(α−α∗) + � l∑ i=1 (αi + α ∗ i ) + l∑ i=1 zi(αi − α∗i ) subject to l∑ i=1 (αi − α∗i ) = 0, 0 ≤ αi, α∗i ≤ C, i = 1, . . . , l, (2.5) 4 where Qij = K(xi,xj) ≡ φ(xi)Tφ(xj). The approximate function is: l∑ i=1 (−αi + α∗i )K(xi,x) + b. 2.5 ν-Support Vector Regression (ν-SVR) Similar to ν-SVC, for regression, Scho¨lkopf et al. (2000) use a parameter ν to control the number of support vectors. However, unlike ν-SVC, where ν replaces with C, here ν replaces with the parameter � of �-SVR. The primal form is min w,b,ξ,ξ∗,� 1 2 wTw + C(ν�+ 1 l l∑ i=1 (ξi + ξ ∗ i )) (2.6) subject to (wTφ(xi) + b)− zi ≤ �+ ξi, zi − (wTφ(xi) + b) ≤ �+ ξ∗i , ξi, ξ ∗ i ≥ 0, i = 1, . . . , l, � ≥ 0. and the dual is min α,α∗ 1 2 (α−α∗)TQ(α−α∗) + zT (α−α∗) subject to eT (α−α∗) = 0, eT (α+α∗) ≤ Cν, 0 ≤ αi, α∗i ≤ C/l, i = 1, . . . , l, (2.7) Similarly, the inequality eT (α + α∗) ≤ Cν can be replaced by an equality. In LIBSVM, we consider C ← C/l, so the dual problem solved is: min α,α∗ 1 2 (α−α∗)TQ(α−α∗) + zT (α−α∗) subject to eT (α−α∗) = 0, eT (α+α∗) = Clν, 0 ≤ αi, α∗i ≤ C, i = 1, . . . , l. (2.8) The decision function is l∑ i=1 (−αi + α∗i )K(xi,x) + b, which is the same as that of �-SVR. 5 3 Solving the Quadratic Problems 3.1 The Decomposition Method for C-SVC, �-SVR, and One- class SVM We consider the following general form of C-SVC, �-SVR, and one-class SVM: min α 1 2 αTQα+ pTα subject to yTα = ∆, (3.1) 0 ≤ αt ≤ C, t = 1, . . . , l, where yt = ±1, t = 1, . . . , l. It can be clearly seen that C-SVC and one-class SVM are already in the form of (3.1). For �-SVR, we consider the following reformulation of (2.5): min α,α∗ 1 2 [ αT , (α∗)T ] [ Q −Q −Q Q ] [ α α∗ ] + [ �eT + zT , �eT − zT ] [α α∗ ] subject to yT [ α α∗ ] = 0, 0 ≤ αt, α∗t ≤ C, t = 1, . . . , l, (3.2) where y is a 2l by 1 vector with yt = 1, t = 1, . . . , l and yt = −1, t = l + 1, . . . , 2l. The difficulty of solving (3.1) is the density of Q because Qij is in general not zero. In LIBSVM, we consider the decomposition method to conquer this difficulty. Some work on this method are, for example, (Osuna et al., 1997a; Joachims, 1998; Platt, 1998). This method modifies only a subset of α per iteration. This subset, denoted as the working set B, leads to a small sub-problem to be minimized in each iteration. An extreme case is the Sequential Minimal Optimization (SMO) (Platt, 1998), which restricts B to have only two elements. Then in each iteration one solves a simple two-variable problem without needing optimization software. Here we consider an SMO-type decomposition method proposed in Fan et al. (2005). Algorithm 1 (An SMO-type Decomposition method in Fan et al. (2005)) 1. Find α1 as the initial feasible solution. Set k = 1. 2. If αk is a stationary point of (2.2), stop. Otherwise, find a two-element working set B = {i, j} by WSS 1 (described in Section 3.2). Define N ≡ {1, . . . , l}\B and αkB and α k N to be sub-vectors of α k corresponding to B and N , respectively. 6 3. If aij ≡ Kii +Kjj − 2Kij > 0 Solve the following sub-problem with the variable αB: min αi,αj 1 2 [ αi αj ] [Qii Qij Qij Qjj ] [ αi αj ] + (pB +QBNα k N) T [ αi αj ] subject to 0 ≤ αi, αj ≤ C, (3.3) yiαi + yjαj = ∆− yTNαkN , else Solve min αi,αj 1 2 [ αi αj ] [Qii Qij Qij Qjj ] [ αi αj ] + (pB +QBNα k N) T [ αi αj ] + τ − aij 4 ((αi − αki )2 + (αj − αkj )2) (3.4) subject to constraints of (3.3). 4. Set αk+1B to be the optimal solution of (3.3) and α k+1 N ≡ αkN . Set k ← k + 1 and goto Step 2. Note that B is updated at each iteration. To simplify the notation, we simply use B instead of Bk. If aij ≤ 0, (3.3) is a concave problem. Hence we use a convex modification in (3.4). 3.2 Stopping Criteria and Working Set Selection for C-SVC, �-SVR, and One-class SVM The Karush-Kuhn-Tucker (KKT) optimality condition of (3.1) shows that a vector α is a stationary point of (3.1) if and only if there is a number b and two nonnegative vectors λ and µ such that ∇f(α) + by = λ− µ, λiαi = 0, µi(C − αi) = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , l, where ∇f(α) ≡ Qα+ p is the gradient of f(α). This condition can be rewritten as ∇f(α)i + byi ≥ 0 if αi < C, (3.5) ∇f(α)i + byi ≤ 0 if αi > 0. (3.6) 7 Since yi = ±1, by defining Iup(α) ≡ {t | αt < C, yt = 1 or αt > 0, yt = −1}, and Ilow(α) ≡ {t | αt < C, yt = −1 or αt > 0, yt = 1}, (3.7) a feasible α is a stationary point of (3.1) if and only if m(α) ≤M(α), (3.8) where m(α) ≡ max i∈Iup(α) −yi∇f(α)i, and M(α) ≡ min i∈Ilow(α) −yi∇f(α)i. From this we have the following stopping condition: m(αk)−M(αk) ≤ �. (3.9) About the selection of the working set set B, we consider the following procedure: WSS 1 1. For all t, s, define ats ≡ Ktt +Kss − 2Kts, bts ≡ −yt∇f(αk)t + ys∇f(αk)s > 0 (3.10) and a¯ts ≡ { ats if ats > 0, τ otherwise. (3.11) Select i ∈ argmax t {−yt∇f(αk)t | t ∈ Iup(αk)}, j ∈ argmin t { − b 2 it a¯it | t ∈ Ilow(αk),−yt∇f(αk)t < −yi∇f(αk)i } . (3.12) 2. Return B = {i, j}. Details of how we choose this working set is in (Fan et al., 2005, Section II). 3.3 Convergence of the Decomposition Method See (Fan et al., 2005, Section III) or (Chen et al., 2006) for a detailed discussion of the convergence of Algorithm 1. 8 3.4 The Decomposition Method for ν-SVC and ν-SVR Both ν-SVC and ν-SVR can be considered as the following general form: min α 1 2 αTQα+ pTα subject to yTα = ∆1, (3.13) eTα = ∆2, 0 ≤ αt ≤ C, t = 1, . . . , l. The KKT condition of (3.13) shows ∇f(α)i − ρ+ byi = 0 if 0 < αi < C, ≥ 0 if αi = 0, ≤ 0 if αi = C. Define r1 ≡ ρ− b, r2 ≡ ρ+ b. If yi = 1 the KKT condition becomes ∇f(α)i − r1 ≥ 0 if αi < C, (3.14) ≤ 0 if αi > 0. On the other hand, if yi = −1, it is ∇f(α)i − r2 ≥ 0 if αi < C, (3.15) ≤ 0 if αi > 0. Hence given a tolerance � > 0, the stopping condition is: max (mp(α)−Mp(α),mn(α)−Mn(α)) < �, (3.16) where mp(α) ≡ max i∈Iup(α),yi=1 −yi∇f(α)i, Mp(α) ≡ min i∈Ilow(α),yi=1 −yi∇f(α)i, and mn(α) ≡ max i∈Iup(α),yi=−1 −yi∇f(α)i, Mn(α) ≡ min i∈Ilow(α),yi=−1 −yi∇f(α)i. The working set selection is by extending WSS 1 to the following 9 WSS 2 (Extending WSS 1 to ν-SVM) 1. Find ip ∈ argmp(αk), jp ∈ argmin t { − b 2 ipt a¯ipt | yt = 1,αt ∈ Ilow(αk),−yt∇f(αk)t < −yip∇f(αk)ip } . 2. Find in ∈ argmn(αk), jn ∈ argmin t { − b 2 int a¯int | yt = −1,αt ∈ Ilow(αk),−yt∇f(αk)t < −yin∇f(αk)in } . 3. Return {ip, jp}) or {in, jn} depending on which one gives smaller −b2ij/a¯ij. 3.5 Analytical Solutions Details are described in Section 5 in which we discuss the solution of a more general sub-problem. 3.6 The Calculation of b or ρ After the solution α of the dual optimization problem is obtained, the variables b or ρ must be calculated as they are used in the decision function. Here we simply describe the case of ν-SVC and ν-SVR where b and ρ both appear. Other formulations are simplified cases of them. The KKT condition of (3.13) has been shown in (3.14) and (3.15). Now we consider the case of yi = 1. If there are αi which satisfy 0 < αi < C, then r1 = ∇f(α)i. Practically to avoid numerical errors, we average them: r1 = ∑ 0<αi M(α¯) or − yi∇f(α¯)i < m(α¯)}. (4.1) Problem (2.2) has unique and bounded optimal solutions at αi, i ∈ I. 2. Assume Algorithm 1 generates an infinite sequence {αk}. There is k¯ such that after k ≥ k¯, every αki , i ∈ I has reached the unique and bounded optimal solution. It remains the same in all subsequent iterations and ∀k ≥ k¯: i 6∈ {t |M(αk) ≤ −yt∇f(αk)t ≤ m(αk)}. (4.2) Hence instead of solving the whole problem (2.2), the decomposition method works on a smaller problem: min αA 1 2 αTAQAAαA − (pA −QANαkN)TαA subject to 0 ≤ (αA)t ≤ C, t = 1, . . . , q, (4.3) yTAαA = ∆− yTNαkN , where N = {1, . . . , l}\A is the set of shrunken variables. Of course this heuristic may fail if the optimal solution of (4.3) is not the cor- responding part of that of (2.2). When that happens, the whole problem (2.2) is 11 reoptimized starting from a point α where αA is an optimal solution of (4.3) and αN are bounded variables identified before the shrinking process. Note that while solving the shrunken problem (4.3), we only know the gradient QAAαA + QANαN + pA of (4.3). Hence when problem (2.2) is reoptimized we also have to reconstruct the whole gradient ∇f(α), which is quite expensive. Many implementations began the shrinking procedure near the end of the iterative process, in LIBSVM however, we start the shrinking process from the beginning. The procedure is as follows: 1. After every min(l, 1000) iterations, we try to shrink some variables. Note that during the iterative process m(αk) > M(αk) (4.4) as (3.9) is not satisfied yet. Following Theorem 4.1, we conjecture that variables in the following set can be shrunken: {t | −yt∇f(αk)t > m(αk), t ∈ Ilow(αk), αkt is bounded} ∪ {t | −yt∇f(αk)t < M(αk), t ∈ Iup(αk), αkt is bounded} = {t | −yt∇f(αk)t > m(αk), αkt = C, yt = 1 or αkt = 0, yt = −1} ∪ {t | −yt∇f(αk)t < M(αk), αkt = 0, yt = 1 or αkt = C, yt = −1}.(4.5) Thus the setA of activated variables is dynamically reduced in every min(l, 1000) iterations. 2. Of course the above shrinking strategy may be too aggressive. Since the decom- position method has a very slow convergence and a large portion of iterations are spent for achieving the final digit of the required accuracy, we would not like those iterations are wasted because of a wrongly shrunken problem (4.3). Hence when the decomposition method first achieves the tolerance m(αk) ≤M(αk) + 10�, where � is the specified stopping criteria, we reconstruct the whole gradient. Then we inactive some variables based on the current set (4.5). and the decom- position method continues. 12 Therefore, in LIBSVM, the size of the set A of (4.3) is dynamically reduced. To decrease the cost of reconstructing the gradient ∇f(α), during the iterations we always keep G¯i = C ∑ αj=C Qij, i = 1, . . . , l. Then for the gradient ∇f(α)i, i /∈ A, we have ∇f(α)i = l∑ j=1 Qijαj + pi = G¯i + ∑ 0<αj mp(αk), αt = C, yt = 1}∪ {t | −yt∇f(αk)t < Mp(αk), αt = 0, yt = 1}. For yt = −1, we shrink the following set: {t | −yt∇f(αk)t > mn(αk), αt = 0, yt = −1}∪ {t | −yt∇f(αk)t < Mn(αk), αt = C, yt = −1}. 4.2 Caching Another technique for reducing the computational time is caching. Since Q is fully dense and may not be stored in the computer memory, elements Qij are calculated as needed. Then usually a special storage using the idea of a cache is used to store recently used Qij (Joachims, 1998). Hence the computational cost of later iterations can be reduced. Theorem 4.1 also supports the use of the cache as in final iterations only some columns of the matrix Q are still needed. Thus if the cache can contain these columns, we can avoid most kernel evaluations in final iterations. In LIBSVM, we implement a simple least-recent-use strategy for the cache. We dynamically cache only recently used columns of QAA of (4.3). 4.3 Computational Complexity The discussion in Section 3.3 is about the asymptotic convergence of the decomposi- tion method. Here, we discuss the computational complexity. 13 The main operations are on finding QBNα k N + pB of (3.3) and the update of ∇f(αk) to ∇f(αk+1). Note that ∇f(α) is used in the working set selection as well as the stopping condition. They can be considered together as QBNα k N + pB = ∇f(αk)−QBBαkB, (4.6) and ∇f(αk+1) = ∇f(αk) +Q:,B(αk+1B −αkB), (4.7) where Q:,B is the sub-matrix of Q with column in B. That is, at the kth iteration, as we already have ∇f(αk), the right-hand-side of (4.6) is used to construct the sub-problem. After the sub-problem is solved, (4.7) is employed to have the next ∇f(αk+1). As B has only two elements and solving the sub-problem is easy, the main cost is Q:,B(α k+1 B −αkB) of (4.7). The operation itself takes O(2l) but if Q:,B is not available in the cache and each kernel evaluation costs O(n), one column indexes of Q:,B already needs O(ln). Therefore, the complexity is: 1. #Iterations×O(l) if most columns of Q are cached during iterations. 2. #Iterations×O(nl) if most columns of Q are cached during iterations and each kernel evaluation is O(n). Note that if shrinking is incorporated, l will gradually decrease during iterations. 5 Unbalanced Data For some classification problems, numbers of data in different classes are unbalanced. Hence some researchers (e.g. (Osuna et al., 1997b)) have proposed to use different penalty parameters in the SVM formulation: For example, C-SVM becomes min w,b,ξ 1 2 wTw + C+ ∑ yi=1 ξi + C− ∑ yi=−1 ξi subject to yi(w Tφ(xi) + b) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . , l. 14 Its dual is min α 1 2 αTQα− eTα subject to 0 ≤ αi ≤ C+, if yi = 1, (5.1) 0 ≤ αi ≤ C−, if yi = −1, (5.2) yTα = 0. Note that by replacing C with different Ci, i = 1, . . . , l, most of the analysis earlier are still correct. Now using C+ and C− is just a special case of it. Therefore, the implementation is almost the same. A main difference is on the solution of the sub- problem (3.3). Now it becomes: min αi,αj 1 2 [ αi αj ] [Qii Qij Qji Qjj ] [ αi αj ] + (Qi,NαN − 1)αi + (Qj,NαN − 1)αj subject to yiαi + yjαj = ∆− yTNαkN , (5.3) 0 ≤ αi ≤ Ci, 0 ≤ αj ≤ Cj, where Ci and Cj can be C+ or C− depending on yi and yj. Let αi = α k i + di, αj = α k j + dj and dˆi ≡ yidi, dˆj ≡ yjdj. Then (5.3) can be written as min di,dj 1 2 [ di dj ] [Qii Qij Qij Qjj ] [ di dj ] + [∇f(αk)i ∇f(αk)j] [didj ] subject to yidi + yjdj = 0, (5.4) −αki ≤ di ≤ C − αki ,−αkj ≤ dj ≤ C − αkj . Define aij and bij as in (3.10). Note that if aij ≤ 0, then a modification similar to (3.4). Using dˆi = −dˆj, the objective function can be written as 1 2 a¯ij dˆ 2 j + bij dˆj. Thus, αnewi = α k i + yibij/a¯ij, αnewj = α k i − yjbij/a¯ij. (5.5) 15 To modify them back to the feasible region, we first consider the case yi 6= yj and

                    本文档为【LIBSVM a Library for Support Vector Machines】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

LIBSVM a Library for Support Vector Machines

你可能还喜欢