Support Vector Machines and Kernel

advertisement
Support Vector Machines and Kernel-Based
Algorithms for Machine Learning
An Introduction
Mehmet Gönen
Department of Computer Engineering
Boğaziçi University
18.05.2007
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Part I
Support Vector Machine (SVM) Theory
1
Binary Classification Problem
2
Hard Margin SVM
3
Soft Margin SVM
4
Regression Problem
5
Regression SVM
6
Kernel Functions
7
Comments
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Binary Classification Problem Definition
Given empirical dataset (X, Y)
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ <d × {−1, +1}
Separate two classes linearly
(hw, xi i + b) ≥ +1 if yi = +1
(hw, xi i + b) ≤ −1 if yi = −1
More succinctly, find a hyperplane such that
yi (hw, xi i + b) ≥ +1
Decision function becomes
f (x) = sgn(hw, xi + b)
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Geometric Motivation for Hard Margin SVM
>
,x
<w
Distance to discriminant
|hw, xi i + b|/ kwk
+
b
=
>
,x
<w
+1
+
We require
yi (hw, xi i + b)/ kwk ≥ ρ
b
=
>
,x
<w
0
+
b
=
-1
To obtain a unique solution
ρ kwk = 1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Hard Margin SVM
Optimization Problem
1
kwk2
2
yi (hw, xi i + b) ≥ 1
minimize
w,b
subject to
i = 1, . . . , n
Lagrangian Dual
n
L(w, b, α)
=
X
1
kwk2 −
αi (yi (hw, xi i + b) − 1)
2
i=1
∂L(w, b, α)
∂w
=
0 ⇒ w=
∂L(w, b, α)
∂b
=
n
X
αi yi xi
i=1
0 ⇒
n
X
αi yi = 0
i=1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Hard Margin SVM
Dual Optimization Problem
maximize
α
subject to
n
X
αi −
i=1
n
X
n
n
1 XX
αi αj yi yj hxi , xj i
2 i=1 j=1
αi yi = 0
i=1
αi ≥ 0
i = 1, . . . , n
Decision Function
f (x)
=
sgn(hw, xi + b)
f (x)
=
sgn(
n
X
αi yi hxi , xi + b)
i=1
Only positive αi ’s contribute
They are called support vectors
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Hard Margin SVM
Karush-Kuhn-Tucker Theorem
αi [yi (hw , xi i + b) − 1] = 0
i = 1, . . . , n
αi > 0 ⇒ yi (hw , xi i + b) − 1 = 0
αi = 0 ⇒ yi (hw , xi i + b) − 1 > 0
xi ’s with αi > 0 are on separating hyperlanes (support vectors)
b can be calculated on one of these intances
(Numerically safer to get average on all xi ’s with αi > 0)
xi ’s with αi = 0 are beyond separating hyperlanes
No need to store them
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Geometric Motivation for Soft Margin SVM
>
,x
<w
Allow misclassification
yi (hw, xi i + b) ≥ 1 − ξi
+
b
=
>
,x
<w
+1
+
Minimize misclassification
#(ξi ≥ 1)
b
=
>
,x
<w
0
+
b
=
0
ξ>
ξ>
1
-1
Hard to solve
Instead use total soft error
n
P
ξi
i=1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Soft Margin SVM
Optimization Problem
n
minimize
w,b,ξ
subject to
X
1
kwk2 + C
ξi
2
i=1
yi (hw, xi i + b) ≥ 1 − ξi
ξi ≥ 0
i = 1, . . . , n
i = 1, . . . , n
Lagrangian Dual
n
n
n
L(w, b, ξ, α, β)
=
X
X
X
1
ξi −
αi (yi (hw, xi i + b) − 1 + ξi ) −
β i ξi
kwk2 + C
2
i=1
i=1
i=1
∂L(w, b, ξ, α, β)
∂w
=
0 ⇒ w=
∂L(w, b, ξ, α, β)
∂b
=
∂L(w, b, ξ, α, β)
∂ξi
=
n
X
αi yi xi
i=1
0 ⇒
n
X
αi yi = 0
i=1
0 ⇒ C = α i + βi
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Soft Margin SVM
Dual Optimization Problem
maximize
α
subject to
n
X
αi −
i=1
n
X
n
n
1 XX
αi αj yi yj hxi , xj i
2 i=1 j=1
αi yi = 0
i=1
C ≥ αi ≥ 0
i = 1, . . . , n
Decision Function
f (x)
=
sgn(hw, xi + b)
f (x)
=
sgn(
n
X
αi yi hxi , xi + b)
i=1
Decision function does not change
Solve this QP with an optimization software to find αi ’s
ILOG CPLEX, MATLAB’s quadprog function, ...
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Soft Margin SVM
Karush-Kuhn-Tucker Theorem
αi [yi (hw , xi i + b) − 1 + ξi ] = 0
i = 1, . . . , n
β i ξi = 0
αi = C
C > αi > 0
αi = 0
⇒
⇒
⇒
yi (hw , xi i + b) − 1 + ξi = 0
yi (hw , xi i + b) − 1 + ξi = 0
yi (hw , xi i + b) − 1 + ξi > 0
ξi > 0
ξi = 0
xi ’s with αi = C make soft error (ξi > 0) (bound support vectors)
xi ’s with C > αi > 0 are on separating hyperlanes (in-bound
support vectors)
b can be calculated on one of these intances
(Numerically safer to get average on all xi ’s with C > αi > 0)
No need to store xi ’s with αi = 0
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Regression Problem Definition
Given empirical dataset (X, Y)
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ <d × <
Use a linear model
f (x) = hw, xi + b
Use the -insensitive
error function
0
if |yi − f (xi )| ≤ e(yi , f (xi )) =
|yi − f (xi )| − otherwise
−ε
+ε
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Regression SVM
Optimization Problem
n
minimize
w,b,ξ+ ,ξ−
X
1
kwk2 + C
(ξi+ + ξi− )
2
i=1
subject to
yi − (hw, xi i + b) ≤ + ξi+
i = 1, . . . , n
(hw, xi i + b) − yi ≤ + ξi−
i = 1, . . . , n
ξi+
ξi−
≥0
i = 1, . . . , n
≥0
i = 1, . . . , n
Lagrangian Dual
n
L(w, b, ξ+ , ξ− , α+ , α− )
=
n
X
X
1
kwk2 + C
(ξi+ + ξi− ) −
(βi+ ξi+ + βi− ξi− )
2
i=1
i=1
−
−
n
X
i=1
n
X
+
α+
i ( + ξi − yi + hw, xi i + b)
−
α−
i ( + ξi + yi − hw, xi i − b)
i=1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Regression SVM
Dual Optimization Problem
maximize
α+ ,α−
n
X
−
yi (α+
i − αi ) −
i=1
−
n
X
n
n
1 XX +
−
+
(α − α−
i )(αj − αj )hxi , xj i
2 i=1 j=1 i
−
(α+
i + αi )
i=1
subject to
n
X
−
(α+
i − αi ) = 0
i=1
C ≥ α+
i ≥0
i = 1, . . . , n
C ≥ α−
i ≥0
i = 1, . . . , n
Decision Function
f (x)
=
hw, xi + b
f (x)
=
n
X
−
(α+
i − αi )hxi , xi + b
i=1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Why Do We Need Kernels?
Define a transformation
function (φ) from input space
to feature space
φ : X 7→ H
Map data from input space to
feature space
x 7→ φ(x)
hxi , xj i 7→ hφ(xi ), φ(xj )i
Learn discriminant in feature
space
No need to calculate φ(.)
explicitly. Just replace
hφ(xi ), φ(xj )i with K (xi , xj )
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
How Do We Integrate Kernels Into Models?
Embed Kernel Function into Dual Optimization Model and Decision Function
maximize
α
maximize
α
maximize
α
f (x)
f (x)
f (x)
n
X
αi −
n
n
1 XX
αi αj yi yj hxi , xj i
2 i=1 j=1
αi −
n
n
1 XX
αi αj yi yj hφ(xi ), φ(xj )i
2 i=1 j=1
αi −
n
n
1 XX
αi αj yi yj K (xi , xj )
2 i=1 j=1
i=1
n
X
i=1
n
X
i=1
=
=
=
sgn(
sgn(
sgn(
n
X
i=1
n
X
i=1
n
X
αi yi hxi , xi + b)
αi yi hφ(xi ), φ(x)i + b)
αi yi K (xi , x) + b)
i=1
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Kernels Functions
Advantages
Adds non-linearity to linear models
Works with non-vectorial data
Popular kernels
Linear Kernel
K (xi , xj ) = hxi , xj i
Polynomial Kernel
K (xi , xj ) = (hxi , xj i + 1)d
Gaussian Kernel
K (xi , xj ) = exp(− kxi − xj k2 /σ 2 )
Sigmoid Kernel
K (xi , xj ) = tanh(2hxi , xj i + 1)
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Comments on SVMs
U
U
U
D
D
Finds global minimum (no local minimum)
Complexity depends on support vector count not on dimensionality
of feature space
Avoids over-fitting and works well with small datasets
Choice of kernel and its parameters
Multi-class classification is an open problem
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
For Further Reading
Vladimir N. Vapnik
The Nature of Statistical Learning Theory
Springer-Verlag, 1995
Bernhard Schölkopf and Alexander J. Smola
Learning with Kernels
The MIT Press, 2002.
Christopher J. C. Burges
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
Alexander J. Smola and Bernhard Schölkopf
A Tutorial on Support Vector Regression
NeuroCOLT2 Technical Report NC2-TR-1998-030, 1998.
Javier M. Moguerza and Alberto Munoz
Support Vector Machines with Applications
Statistical Science, 21(3):322–336, 2006.
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Part II
Multi-Class SVMs
8
Multi-Machine Approaches
One-Versus-All Approach (OVA)
All-Versus-All Approach (AVA)
9
Single Machine Approaches
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
One-Versus-All Approach (OVA)
+ -
H2
C1
- +
H3
+ H1
-
C2
C3
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
One-Versus-All Approach (OVA)
k distinct binary SVM that separates one class from others
One class has label +1, others −1
+1 labeled class of SVM with maximum output value is assigned to
test instance
k optimization problem with n decision variables
Comparison between SVM outputs may be problematic
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
All-Versus-All Approach (AVA)
-
+
H12
C1
H13 +
C2
C3
H23
+
-
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
All-Versus-All Approach (AVA)
k(k − 1)/2 distinct binary SVM for each possible pair of classes
A voting scheme is required for testing
k(k − 1)/2 optimization problem with 2n/k decision variables
(homogeneous dataset)
Possible variance increase due to small training set sizes
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Single Machine Multi-Class SVM
More natural way is to consider all classes at once
Following SVM learns k discriminant together
Optimization Problem
minimize
w,b,ξ
subject to
k
n X
X
1X
kwm k2 + C
ξim
2 m=1
i=1 m6=y
i
hwyi , xi i + byi ≥ hwm , xi i + bm + 2 − ξim
ξim ≥ 0
i = 1, . . . , n
i = 1, . . . , n
m 6= yi
m 6= yi
Decision Function
f (x)
=
arg max(hwm , xi + bm )
Mehmet Gönen
m
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
For Further Reading
Vladimir N. Vapnik
Statistical Learning Theory
John Wiley and Sons, 1998
Jason Weston and Chris Watkins
Multi-Class Support Vector Machines
Technical Report CSD-TR-98-04, Department of Computer Science,
Royal Holloway, University of London, 1998.
Chih-Wei Hsu and Chih-Jen Lin
A Comparison of Methods for Multi-Class Support Vector Machines
Neural Networks, IEEE Transactions on, 13(2):415–425, 2002.
Ryan Rifkin and Aldebaro Klautau
In Defense of One-Vs-All Classification
Journal of Machine Learning Research, 5:101–141, 2004.
Eddy Mayoraz and Ethem Alpaydın
Support Vector Machines for Multi-Class Classification
Engineering Applications of Bio-Inspired Artificial Neural Networks,
833–842, 1999.
Mehmet Gönen
Support Vector Machines and Kernel-Based Algorithms for Machine Learning
Download