LIBSVM: a Library for Support Vector Machines
Chih-Chung Chang and Chih-Jen Lin∗
Last updated: December 27, 2006
Abstract
LIBSVM is a library for support vector machines (SVM). Its goal is to help
users to easily use SVM as a tool. In this document, we present all its imple-
mentation details. For the use of LIBSVM, the README file included in the
package and the LIBSVM FAQ provide the information.
1 Introduction
LIBSVM is a library for support vector machines (SVM). Its goal is to let users can
easily use SVM as a tool. In this document, we present all its implementation details.
For using LIBSVM, the README file included in the package provides the information.
In Section 2, we show formulations used in LIBSVM: C-support vector classifica-
tion (C-SVC), ν-support vector classification (ν-SVC), distribution estimation (one-
class SVM), �-support vector regression (�-SVR), and ν-support vector regression
(ν-SVR). We discuss the implementation of solving quadratic problems in Section 3.
Section 4 describes two implementation techniques: shrinking and caching. We also
support different penalty parameters for unbalanced data. Details are in Section 5.
Then Section 6 discusses the implementation of multi-class classification. Parameter
selection is important for obtaining good SVM models. LIBSVM provides simple and
useful tools, which are discussed in Section 7. Section 8 presents the implementation
of probability outputs.
2 Formulations
2.1 C-Support Vector Classification
Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rl
such that yi ∈ {1,−1}, C-SVC (Cortes and Vapnik, 1995; Vapnik, 1998) solves the
∗Department of Computer Science, National Taiwan University, Taipei 106, Taiwan (http://
www.csie.ntu.edu.tw/∼cjlin). E-mail: cjlin@csie.ntu.edu.tw
1
following primal problem:
min
w,b,ξ
1
2
wTw + C
l∑
i=1
ξi (2.1)
subject to yi(w
Tφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
Its dual is
min
α
1
2
αTQα− eTα
subject to yTα = 0, (2.2)
0 ≤ αi ≤ C, i = 1, . . . , l,
where e is the vector of all ones, C > 0 is the upper bound, Q is an l by l positive
semidefinite matrix, Qij ≡ yiyjK(xi,xj), and K(xi,xj) ≡ φ(xi)Tφ(xj) is the kernel.
Here training vectors xi are mapped into a higher (maybe infinite) dimensional space
by the function φ.
The decision function is
sgn
(
l∑
i=1
yiαiK(xi,x) + b
)
.
2.2 ν-Support Vector Classification
The ν-support vector classification (Scho¨lkopf et al., 2000) uses a new parameter ν
which controls the number of support vectors and training errors. The parameter
ν ∈ (0, 1] is an upper bound on the fraction of training errors and a lower bound of
the fraction of support vectors.
Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rl
such that yi ∈ {1,−1}, the primal form considered is:
min
w,b,ξ,ρ
1
2
wTw − νρ+ 1
l
l∑
i=1
ξi
subject to yi(w
Tφ(xi) + b) ≥ ρ− ξi,
ξi ≥ 0, i = 1, . . . , l, ρ ≥ 0.
2
The dual is:
min
α
1
2
αTQα
subject to 0 ≤ αi ≤ 1/l, i = 1, . . . , l, (2.3)
eTα ≥ ν, yTα = 0.
where Qij ≡ yiyjK(xi,xj).
The decision function is:
sgn
(
l∑
i=1
yiαi(K(xi,x) + b)
)
.
In (Crisp and Burges, 2000; Chang and Lin, 2001), it has been shown that eTα ≥ ν
can be replaced by eTα = ν. With this property, in LIBSVM, we solve a scaled version
of (2.3):
min
α
1
2
αTQα
subject to 0 ≤ αi ≤ 1, i = 1, . . . , l,
eTα = νl,
yTα = 0.
We output α/ρ so the computed decision function is:
sgn
(
l∑
i=1
yi(αi/ρ)(K(xi,x) + b)
)
and then two margins are
yi(w
Tφ(xi) + b) = ±1
which are the same as those of C-SVC.
2.3 Distribution Estimation (One-class SVM)
One-class SVM was proposed by Scho¨lkopf et al. (2001) for estimating the support of
a high-dimensional distribution. Given training vectors xi ∈ Rn, i = 1, . . . , l without
any class information, the primal form in (Scho¨lkopf et al., 2001) is:
min
w,b,ξ,ρ
1
2
wTw − ρ+ 1
νl
l∑
i=1
ξi
subject to wTφ(xi) ≥ ρ− ξi,
ξi ≥ 0, i = 1, . . . , l.
3
The dual is:
min
α
1
2
αTQα
subject to 0 ≤ αi ≤ 1/(νl), i = 1, . . . , l, (2.4)
eTα = 1,
where Qij = K(xi,xj) ≡ φ(xi)Tφ(xj).
In LIBSVM we solve a scaled version of (2.4):
min
1
2
αTQα
subject to 0 ≤ αi ≤ 1, i = 1, . . . , l,
eTα = νl.
The decision function is
sgn(
l∑
i=1
αiK(xi,x)− ρ).
2.4 �-Support Vector Regression (�-SVR)
Given a set of data points, {(x1, z1), . . . , (xl, zl)}, such that xi ∈ Rn is an input and
zi ∈ R1 is a target output, the standard form of support vector regression (Vapnik,
1998) is:
min
w,b,ξ,ξ∗
1
2
wTw + C
l∑
i=1
ξi + C
l∑
i=1
ξ∗i
subject to wTφ(xi) + b− zi ≤ �+ ξi,
zi −wTφ(xi)− b ≤ �+ ξ∗i ,
ξi, ξ
∗
i ≥ 0, i = 1, . . . , l.
The dual is:
min
α,α∗
1
2
(α−α∗)TQ(α−α∗) + �
l∑
i=1
(αi + α
∗
i ) +
l∑
i=1
zi(αi − α∗i )
subject to
l∑
i=1
(αi − α∗i ) = 0, 0 ≤ αi, α∗i ≤ C, i = 1, . . . , l, (2.5)
4
where Qij = K(xi,xj) ≡ φ(xi)Tφ(xj).
The approximate function is:
l∑
i=1
(−αi + α∗i )K(xi,x) + b.
2.5 ν-Support Vector Regression (ν-SVR)
Similar to ν-SVC, for regression, Scho¨lkopf et al. (2000) use a parameter ν to control
the number of support vectors. However, unlike ν-SVC, where ν replaces with C,
here ν replaces with the parameter � of �-SVR. The primal form is
min
w,b,ξ,ξ∗,�
1
2
wTw + C(ν�+
1
l
l∑
i=1
(ξi + ξ
∗
i )) (2.6)
subject to (wTφ(xi) + b)− zi ≤ �+ ξi,
zi − (wTφ(xi) + b) ≤ �+ ξ∗i ,
ξi, ξ
∗
i ≥ 0, i = 1, . . . , l, � ≥ 0.
and the dual is
min
α,α∗
1
2
(α−α∗)TQ(α−α∗) + zT (α−α∗)
subject to eT (α−α∗) = 0, eT (α+α∗) ≤ Cν,
0 ≤ αi, α∗i ≤ C/l, i = 1, . . . , l, (2.7)
Similarly, the inequality eT (α + α∗) ≤ Cν can be replaced by an equality. In
LIBSVM, we consider C ← C/l, so the dual problem solved is:
min
α,α∗
1
2
(α−α∗)TQ(α−α∗) + zT (α−α∗)
subject to eT (α−α∗) = 0, eT (α+α∗) = Clν,
0 ≤ αi, α∗i ≤ C, i = 1, . . . , l. (2.8)
The decision function is
l∑
i=1
(−αi + α∗i )K(xi,x) + b,
which is the same as that of �-SVR.
5
3 Solving the Quadratic Problems
3.1 The Decomposition Method for C-SVC, �-SVR, and One-
class SVM
We consider the following general form of C-SVC, �-SVR, and one-class SVM:
min
α
1
2
αTQα+ pTα
subject to yTα = ∆, (3.1)
0 ≤ αt ≤ C, t = 1, . . . , l,
where yt = ±1, t = 1, . . . , l. It can be clearly seen that C-SVC and one-class SVM
are already in the form of (3.1). For �-SVR, we consider the following reformulation
of (2.5):
min
α,α∗
1
2
[
αT , (α∗)T
] [ Q −Q
−Q Q
] [
α
α∗
]
+
[
�eT + zT , �eT − zT ] [α
α∗
]
subject to yT
[
α
α∗
]
= 0, 0 ≤ αt, α∗t ≤ C, t = 1, . . . , l, (3.2)
where y is a 2l by 1 vector with yt = 1, t = 1, . . . , l and yt = −1, t = l + 1, . . . , 2l.
The difficulty of solving (3.1) is the density of Q because Qij is in general not zero.
In LIBSVM, we consider the decomposition method to conquer this difficulty. Some
work on this method are, for example, (Osuna et al., 1997a; Joachims, 1998; Platt,
1998). This method modifies only a subset of α per iteration. This subset, denoted
as the working set B, leads to a small sub-problem to be minimized in each iteration.
An extreme case is the Sequential Minimal Optimization (SMO) (Platt, 1998), which
restricts B to have only two elements. Then in each iteration one solves a simple
two-variable problem without needing optimization software. Here we consider an
SMO-type decomposition method proposed in Fan et al. (2005).
Algorithm 1 (An SMO-type Decomposition method in Fan et al. (2005))
1. Find α1 as the initial feasible solution. Set k = 1.
2. If αk is a stationary point of (2.2), stop. Otherwise, find a two-element working
set B = {i, j} by WSS 1 (described in Section 3.2). Define N ≡ {1, . . . , l}\B
and αkB and α
k
N to be sub-vectors of α
k corresponding to B and N , respectively.
6
3. If aij ≡ Kii +Kjj − 2Kij > 0
Solve the following sub-problem with the variable αB:
min
αi,αj
1
2
[
αi αj
] [Qii Qij
Qij Qjj
] [
αi
αj
]
+ (pB +QBNα
k
N)
T
[
αi
αj
]
subject to 0 ≤ αi, αj ≤ C, (3.3)
yiαi + yjαj = ∆− yTNαkN ,
else
Solve
min
αi,αj
1
2
[
αi αj
] [Qii Qij
Qij Qjj
] [
αi
αj
]
+ (pB +QBNα
k
N)
T
[
αi
αj
]
+
τ − aij
4
((αi − αki )2 + (αj − αkj )2) (3.4)
subject to constraints of (3.3).
4. Set αk+1B to be the optimal solution of (3.3) and α
k+1
N ≡ αkN . Set k ← k + 1
and goto Step 2.
Note that B is updated at each iteration. To simplify the notation, we simply
use B instead of Bk. If aij ≤ 0, (3.3) is a concave problem. Hence we use a convex
modification in (3.4).
3.2 Stopping Criteria and Working Set Selection for C-SVC,
�-SVR, and One-class SVM
The Karush-Kuhn-Tucker (KKT) optimality condition of (3.1) shows that a vector α
is a stationary point of (3.1) if and only if there is a number b and two nonnegative
vectors λ and µ such that
∇f(α) + by = λ− µ,
λiαi = 0, µi(C − αi) = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , l,
where ∇f(α) ≡ Qα+ p is the gradient of f(α). This condition can be rewritten as
∇f(α)i + byi ≥ 0 if αi < C, (3.5)
∇f(α)i + byi ≤ 0 if αi > 0. (3.6)
7
Since yi = ±1, by defining
Iup(α) ≡ {t | αt < C, yt = 1 or αt > 0, yt = −1}, and
Ilow(α) ≡ {t | αt < C, yt = −1 or αt > 0, yt = 1},
(3.7)
a feasible α is a stationary point of (3.1) if and only if
m(α) ≤M(α), (3.8)
where
m(α) ≡ max
i∈Iup(α)
−yi∇f(α)i, and M(α) ≡ min
i∈Ilow(α)
−yi∇f(α)i.
From this we have the following stopping condition:
m(αk)−M(αk) ≤ �. (3.9)
About the selection of the working set set B, we consider the following procedure:
WSS 1
1. For all t, s, define
ats ≡ Ktt +Kss − 2Kts, bts ≡ −yt∇f(αk)t + ys∇f(αk)s > 0 (3.10)
and
a¯ts ≡
{
ats if ats > 0,
τ otherwise.
(3.11)
Select
i ∈ argmax
t
{−yt∇f(αk)t | t ∈ Iup(αk)},
j ∈ argmin
t
{
− b
2
it
a¯it
| t ∈ Ilow(αk),−yt∇f(αk)t < −yi∇f(αk)i
}
. (3.12)
2. Return B = {i, j}.
Details of how we choose this working set is in (Fan et al., 2005, Section II).
3.3 Convergence of the Decomposition Method
See (Fan et al., 2005, Section III) or (Chen et al., 2006) for a detailed discussion of
the convergence of Algorithm 1.
8
3.4 The Decomposition Method for ν-SVC and ν-SVR
Both ν-SVC and ν-SVR can be considered as the following general form:
min
α
1
2
αTQα+ pTα
subject to yTα = ∆1, (3.13)
eTα = ∆2,
0 ≤ αt ≤ C, t = 1, . . . , l.
The KKT condition of (3.13) shows
∇f(α)i − ρ+ byi = 0 if 0 < αi < C,
≥ 0 if αi = 0,
≤ 0 if αi = C.
Define
r1 ≡ ρ− b, r2 ≡ ρ+ b.
If yi = 1 the KKT condition becomes
∇f(α)i − r1 ≥ 0 if αi < C, (3.14)
≤ 0 if αi > 0.
On the other hand, if yi = −1, it is
∇f(α)i − r2 ≥ 0 if αi < C, (3.15)
≤ 0 if αi > 0.
Hence given a tolerance � > 0, the stopping condition is:
max (mp(α)−Mp(α),mn(α)−Mn(α)) < �, (3.16)
where
mp(α) ≡ max
i∈Iup(α),yi=1
−yi∇f(α)i, Mp(α) ≡ min
i∈Ilow(α),yi=1
−yi∇f(α)i, and
mn(α) ≡ max
i∈Iup(α),yi=−1
−yi∇f(α)i, Mn(α) ≡ min
i∈Ilow(α),yi=−1
−yi∇f(α)i.
The working set selection is by extending WSS 1 to the following
9
WSS 2 (Extending WSS 1 to ν-SVM)
1. Find
ip ∈ argmp(αk),
jp ∈ argmin
t
{
− b
2
ipt
a¯ipt
| yt = 1,αt ∈ Ilow(αk),−yt∇f(αk)t < −yip∇f(αk)ip
}
.
2. Find
in ∈ argmn(αk),
jn ∈ argmin
t
{
− b
2
int
a¯int
| yt = −1,αt ∈ Ilow(αk),−yt∇f(αk)t < −yin∇f(αk)in
}
.
3. Return {ip, jp}) or {in, jn} depending on which one gives smaller −b2ij/a¯ij.
3.5 Analytical Solutions
Details are described in Section 5 in which we discuss the solution of a more general
sub-problem.
3.6 The Calculation of b or ρ
After the solution α of the dual optimization problem is obtained, the variables b or ρ
must be calculated as they are used in the decision function. Here we simply describe
the case of ν-SVC and ν-SVR where b and ρ both appear. Other formulations are
simplified cases of them.
The KKT condition of (3.13) has been shown in (3.14) and (3.15). Now we consider
the case of yi = 1. If there are αi which satisfy 0 < αi < C, then r1 = ∇f(α)i.
Practically to avoid numerical errors, we average them:
r1 =
∑
0<αi M(α¯) or − yi∇f(α¯)i < m(α¯)}. (4.1)
Problem (2.2) has unique and bounded optimal solutions at αi, i ∈ I.
2. Assume Algorithm 1 generates an infinite sequence {αk}. There is k¯ such that
after k ≥ k¯, every αki , i ∈ I has reached the unique and bounded optimal solution.
It remains the same in all subsequent iterations and ∀k ≥ k¯:
i 6∈ {t |M(αk) ≤ −yt∇f(αk)t ≤ m(αk)}. (4.2)
Hence instead of solving the whole problem (2.2), the decomposition method works
on a smaller problem:
min
αA
1
2
αTAQAAαA − (pA −QANαkN)TαA
subject to 0 ≤ (αA)t ≤ C, t = 1, . . . , q, (4.3)
yTAαA = ∆− yTNαkN ,
where N = {1, . . . , l}\A is the set of shrunken variables.
Of course this heuristic may fail if the optimal solution of (4.3) is not the cor-
responding part of that of (2.2). When that happens, the whole problem (2.2) is
11
reoptimized starting from a point α where αA is an optimal solution of (4.3) and αN
are bounded variables identified before the shrinking process. Note that while solving
the shrunken problem (4.3), we only know the gradient QAAαA + QANαN + pA of
(4.3). Hence when problem (2.2) is reoptimized we also have to reconstruct the whole
gradient ∇f(α), which is quite expensive.
Many implementations began the shrinking procedure near the end of the iterative
process, in LIBSVM however, we start the shrinking process from the beginning. The
procedure is as follows:
1. After every min(l, 1000) iterations, we try to shrink some variables. Note that
during the iterative process
m(αk) > M(αk) (4.4)
as (3.9) is not satisfied yet. Following Theorem 4.1, we conjecture that variables
in the following set can be shrunken:
{t | −yt∇f(αk)t > m(αk), t ∈ Ilow(αk), αkt is bounded} ∪
{t | −yt∇f(αk)t < M(αk), t ∈ Iup(αk), αkt is bounded}
= {t | −yt∇f(αk)t > m(αk), αkt = C, yt = 1 or αkt = 0, yt = −1} ∪
{t | −yt∇f(αk)t < M(αk), αkt = 0, yt = 1 or αkt = C, yt = −1}.(4.5)
Thus the setA of activated variables is dynamically reduced in every min(l, 1000)
iterations.
2. Of course the above shrinking strategy may be too aggressive. Since the decom-
position method has a very slow convergence and a large portion of iterations
are spent for achieving the final digit of the required accuracy, we would not
like those iterations are wasted because of a wrongly shrunken problem (4.3).
Hence when the decomposition method first achieves the tolerance
m(αk) ≤M(αk) + 10�,
where � is the specified stopping criteria, we reconstruct the whole gradient.
Then we inactive some variables based on the current set (4.5). and the decom-
position method continues.
12
Therefore, in LIBSVM, the size of the set A of (4.3) is dynamically reduced. To
decrease the cost of reconstructing the gradient ∇f(α), during the iterations we
always keep
G¯i = C
∑
αj=C
Qij, i = 1, . . . , l.
Then for the gradient ∇f(α)i, i /∈ A, we have
∇f(α)i =
l∑
j=1
Qijαj + pi = G¯i +
∑
0<αj mp(αk), αt = C, yt = 1}∪
{t | −yt∇f(αk)t < Mp(αk), αt = 0, yt = 1}.
For yt = −1, we shrink the following set:
{t | −yt∇f(αk)t > mn(αk), αt = 0, yt = −1}∪
{t | −yt∇f(αk)t < Mn(αk), αt = C, yt = −1}.
4.2 Caching
Another technique for reducing the computational time is caching. Since Q is fully
dense and may not be stored in the computer memory, elements Qij are calculated
as needed. Then usually a special storage using the idea of a cache is used to store
recently used Qij (Joachims, 1998). Hence the computational cost of later iterations
can be reduced.
Theorem 4.1 also supports the use of the cache as in final iterations only some
columns of the matrix Q are still needed. Thus if the cache can contain these columns,
we can avoid most kernel evaluations in final iterations.
In LIBSVM, we implement a simple least-recent-use strategy for the cache. We
dynamically cache only recently used columns of QAA of (4.3).
4.3 Computational Complexity
The discussion in Section 3.3 is about the asymptotic convergence of the decomposi-
tion method. Here, we discuss the computational complexity.
13
The main operations are on finding QBNα
k
N + pB of (3.3) and the update of
∇f(αk) to ∇f(αk+1). Note that ∇f(α) is used in the working set selection as well
as the stopping condition. They can be considered together as
QBNα
k
N + pB = ∇f(αk)−QBBαkB, (4.6)
and
∇f(αk+1) = ∇f(αk) +Q:,B(αk+1B −αkB), (4.7)
where Q:,B is the sub-matrix of Q with column in B. That is, at the kth iteration,
as we already have ∇f(αk), the right-hand-side of (4.6) is used to construct the
sub-problem. After the sub-problem is solved, (4.7) is employed to have the next
∇f(αk+1). As B has only two elements and solving the sub-problem is easy, the
main cost is Q:,B(α
k+1
B −αkB) of (4.7). The operation itself takes O(2l) but if Q:,B is
not available in the cache and each kernel evaluation costs O(n), one column indexes
of Q:,B already needs O(ln). Therefore, the complexity is:
1. #Iterations×O(l) if most columns of Q are cached during iterations.
2. #Iterations×O(nl) if most columns of Q are cached during iterations and each
kernel evaluation is O(n).
Note that if shrinking is incorporated, l will gradually decrease during iterations.
5 Unbalanced Data
For some classification problems, numbers of data in different classes are unbalanced.
Hence some researchers (e.g. (Osuna et al., 1997b)) have proposed to use different
penalty parameters in the SVM formulation: For example, C-SVM becomes
min
w,b,ξ
1
2
wTw + C+
∑
yi=1
ξi + C−
∑
yi=−1
ξi
subject to yi(w
Tφ(xi) + b) ≥ 1− ξi,
ξi ≥ 0, i = 1, . . . , l.
14
Its dual is
min
α
1
2
αTQα− eTα
subject to 0 ≤ αi ≤ C+, if yi = 1, (5.1)
0 ≤ αi ≤ C−, if yi = −1, (5.2)
yTα = 0.
Note that by replacing C with different Ci, i = 1, . . . , l, most of the analysis earlier
are still correct. Now using C+ and C− is just a special case of it. Therefore, the
implementation is almost the same. A main difference is on the solution of the sub-
problem (3.3). Now it becomes:
min
αi,αj
1
2
[
αi αj
] [Qii Qij
Qji Qjj
] [
αi
αj
]
+ (Qi,NαN − 1)αi + (Qj,NαN − 1)αj
subject to yiαi + yjαj = ∆− yTNαkN , (5.3)
0 ≤ αi ≤ Ci, 0 ≤ αj ≤ Cj,
where Ci and Cj can be C+ or C− depending on yi and yj.
Let αi = α
k
i + di, αj = α
k
j + dj and dˆi ≡ yidi, dˆj ≡ yjdj. Then (5.3) can be written
as
min
di,dj
1
2
[
di dj
] [Qii Qij
Qij Qjj
] [
di
dj
]
+
[∇f(αk)i ∇f(αk)j] [didj
]
subject to yidi + yjdj = 0, (5.4)
−αki ≤ di ≤ C − αki ,−αkj ≤ dj ≤ C − αkj .
Define aij and bij as in (3.10). Note that if aij ≤ 0, then a modification similar to
(3.4). Using dˆi = −dˆj, the objective function can be written as
1
2
a¯ij dˆ
2
j + bij dˆj.
Thus,
αnewi = α
k
i + yibij/a¯ij,
αnewj = α
k
i − yjbij/a¯ij. (5.5)
15
To modify them back to the feasible region, we first consider the case yi 6= yj and
本文档为【LIBSVM a Library for Support Vector Machines】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。