Support Vector Machines and Kernel-Based Algorithms for Machine Learning An Introduction Mehmet Gönen Department of Computer Engineering Boğaziçi University 18.05.2007 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Part I Support Vector Machine (SVM) Theory 1 Binary Classification Problem 2 Hard Margin SVM 3 Soft Margin SVM 4 Regression Problem 5 Regression SVM 6 Kernel Functions 7 Comments Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Binary Classification Problem Definition Given empirical dataset (X, Y) (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ <d × {−1, +1} Separate two classes linearly (hw, xi i + b) ≥ +1 if yi = +1 (hw, xi i + b) ≤ −1 if yi = −1 More succinctly, find a hyperplane such that yi (hw, xi i + b) ≥ +1 Decision function becomes f (x) = sgn(hw, xi + b) Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Geometric Motivation for Hard Margin SVM > ,x <w Distance to discriminant |hw, xi i + b|/ kwk + b = > ,x <w +1 + We require yi (hw, xi i + b)/ kwk ≥ ρ b = > ,x <w 0 + b = -1 To obtain a unique solution ρ kwk = 1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Hard Margin SVM Optimization Problem 1 kwk2 2 yi (hw, xi i + b) ≥ 1 minimize w,b subject to i = 1, . . . , n Lagrangian Dual n L(w, b, α) = X 1 kwk2 − αi (yi (hw, xi i + b) − 1) 2 i=1 ∂L(w, b, α) ∂w = 0 ⇒ w= ∂L(w, b, α) ∂b = n X αi yi xi i=1 0 ⇒ n X αi yi = 0 i=1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Hard Margin SVM Dual Optimization Problem maximize α subject to n X αi − i=1 n X n n 1 XX αi αj yi yj hxi , xj i 2 i=1 j=1 αi yi = 0 i=1 αi ≥ 0 i = 1, . . . , n Decision Function f (x) = sgn(hw, xi + b) f (x) = sgn( n X αi yi hxi , xi + b) i=1 Only positive αi ’s contribute They are called support vectors Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Hard Margin SVM Karush-Kuhn-Tucker Theorem αi [yi (hw , xi i + b) − 1] = 0 i = 1, . . . , n αi > 0 ⇒ yi (hw , xi i + b) − 1 = 0 αi = 0 ⇒ yi (hw , xi i + b) − 1 > 0 xi ’s with αi > 0 are on separating hyperlanes (support vectors) b can be calculated on one of these intances (Numerically safer to get average on all xi ’s with αi > 0) xi ’s with αi = 0 are beyond separating hyperlanes No need to store them Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Geometric Motivation for Soft Margin SVM > ,x <w Allow misclassification yi (hw, xi i + b) ≥ 1 − ξi + b = > ,x <w +1 + Minimize misclassification #(ξi ≥ 1) b = > ,x <w 0 + b = 0 ξ> ξ> 1 -1 Hard to solve Instead use total soft error n P ξi i=1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Soft Margin SVM Optimization Problem n minimize w,b,ξ subject to X 1 kwk2 + C ξi 2 i=1 yi (hw, xi i + b) ≥ 1 − ξi ξi ≥ 0 i = 1, . . . , n i = 1, . . . , n Lagrangian Dual n n n L(w, b, ξ, α, β) = X X X 1 ξi − αi (yi (hw, xi i + b) − 1 + ξi ) − β i ξi kwk2 + C 2 i=1 i=1 i=1 ∂L(w, b, ξ, α, β) ∂w = 0 ⇒ w= ∂L(w, b, ξ, α, β) ∂b = ∂L(w, b, ξ, α, β) ∂ξi = n X αi yi xi i=1 0 ⇒ n X αi yi = 0 i=1 0 ⇒ C = α i + βi Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Soft Margin SVM Dual Optimization Problem maximize α subject to n X αi − i=1 n X n n 1 XX αi αj yi yj hxi , xj i 2 i=1 j=1 αi yi = 0 i=1 C ≥ αi ≥ 0 i = 1, . . . , n Decision Function f (x) = sgn(hw, xi + b) f (x) = sgn( n X αi yi hxi , xi + b) i=1 Decision function does not change Solve this QP with an optimization software to find αi ’s ILOG CPLEX, MATLAB’s quadprog function, ... Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Soft Margin SVM Karush-Kuhn-Tucker Theorem αi [yi (hw , xi i + b) − 1 + ξi ] = 0 i = 1, . . . , n β i ξi = 0 αi = C C > αi > 0 αi = 0 ⇒ ⇒ ⇒ yi (hw , xi i + b) − 1 + ξi = 0 yi (hw , xi i + b) − 1 + ξi = 0 yi (hw , xi i + b) − 1 + ξi > 0 ξi > 0 ξi = 0 xi ’s with αi = C make soft error (ξi > 0) (bound support vectors) xi ’s with C > αi > 0 are on separating hyperlanes (in-bound support vectors) b can be calculated on one of these intances (Numerically safer to get average on all xi ’s with C > αi > 0) No need to store xi ’s with αi = 0 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Regression Problem Definition Given empirical dataset (X, Y) (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ <d × < Use a linear model f (x) = hw, xi + b Use the -insensitive error function 0 if |yi − f (xi )| ≤ e(yi , f (xi )) = |yi − f (xi )| − otherwise −ε +ε Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Regression SVM Optimization Problem n minimize w,b,ξ+ ,ξ− X 1 kwk2 + C (ξi+ + ξi− ) 2 i=1 subject to yi − (hw, xi i + b) ≤ + ξi+ i = 1, . . . , n (hw, xi i + b) − yi ≤ + ξi− i = 1, . . . , n ξi+ ξi− ≥0 i = 1, . . . , n ≥0 i = 1, . . . , n Lagrangian Dual n L(w, b, ξ+ , ξ− , α+ , α− ) = n X X 1 kwk2 + C (ξi+ + ξi− ) − (βi+ ξi+ + βi− ξi− ) 2 i=1 i=1 − − n X i=1 n X + α+ i ( + ξi − yi + hw, xi i + b) − α− i ( + ξi + yi − hw, xi i − b) i=1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Regression SVM Dual Optimization Problem maximize α+ ,α− n X − yi (α+ i − αi ) − i=1 − n X n n 1 XX + − + (α − α− i )(αj − αj )hxi , xj i 2 i=1 j=1 i − (α+ i + αi ) i=1 subject to n X − (α+ i − αi ) = 0 i=1 C ≥ α+ i ≥0 i = 1, . . . , n C ≥ α− i ≥0 i = 1, . . . , n Decision Function f (x) = hw, xi + b f (x) = n X − (α+ i − αi )hxi , xi + b i=1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Why Do We Need Kernels? Define a transformation function (φ) from input space to feature space φ : X 7→ H Map data from input space to feature space x 7→ φ(x) hxi , xj i 7→ hφ(xi ), φ(xj )i Learn discriminant in feature space No need to calculate φ(.) explicitly. Just replace hφ(xi ), φ(xj )i with K (xi , xj ) Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning How Do We Integrate Kernels Into Models? Embed Kernel Function into Dual Optimization Model and Decision Function maximize α maximize α maximize α f (x) f (x) f (x) n X αi − n n 1 XX αi αj yi yj hxi , xj i 2 i=1 j=1 αi − n n 1 XX αi αj yi yj hφ(xi ), φ(xj )i 2 i=1 j=1 αi − n n 1 XX αi αj yi yj K (xi , xj ) 2 i=1 j=1 i=1 n X i=1 n X i=1 = = = sgn( sgn( sgn( n X i=1 n X i=1 n X αi yi hxi , xi + b) αi yi hφ(xi ), φ(x)i + b) αi yi K (xi , x) + b) i=1 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Kernels Functions Advantages Adds non-linearity to linear models Works with non-vectorial data Popular kernels Linear Kernel K (xi , xj ) = hxi , xj i Polynomial Kernel K (xi , xj ) = (hxi , xj i + 1)d Gaussian Kernel K (xi , xj ) = exp(− kxi − xj k2 /σ 2 ) Sigmoid Kernel K (xi , xj ) = tanh(2hxi , xj i + 1) Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Comments on SVMs U U U D D Finds global minimum (no local minimum) Complexity depends on support vector count not on dimensionality of feature space Avoids over-fitting and works well with small datasets Choice of kernel and its parameters Multi-class classification is an open problem Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning For Further Reading Vladimir N. Vapnik The Nature of Statistical Learning Theory Springer-Verlag, 1995 Bernhard Schölkopf and Alexander J. Smola Learning with Kernels The MIT Press, 2002. Christopher J. C. Burges A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge Discovery, 2(2):121–167, 1998. Alexander J. Smola and Bernhard Schölkopf A Tutorial on Support Vector Regression NeuroCOLT2 Technical Report NC2-TR-1998-030, 1998. Javier M. Moguerza and Alberto Munoz Support Vector Machines with Applications Statistical Science, 21(3):322–336, 2006. Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Part II Multi-Class SVMs 8 Multi-Machine Approaches One-Versus-All Approach (OVA) All-Versus-All Approach (AVA) 9 Single Machine Approaches Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning One-Versus-All Approach (OVA) + - H2 C1 - + H3 + H1 - C2 C3 Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning One-Versus-All Approach (OVA) k distinct binary SVM that separates one class from others One class has label +1, others −1 +1 labeled class of SVM with maximum output value is assigned to test instance k optimization problem with n decision variables Comparison between SVM outputs may be problematic Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning All-Versus-All Approach (AVA) - + H12 C1 H13 + C2 C3 H23 + - Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning All-Versus-All Approach (AVA) k(k − 1)/2 distinct binary SVM for each possible pair of classes A voting scheme is required for testing k(k − 1)/2 optimization problem with 2n/k decision variables (homogeneous dataset) Possible variance increase due to small training set sizes Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning Single Machine Multi-Class SVM More natural way is to consider all classes at once Following SVM learns k discriminant together Optimization Problem minimize w,b,ξ subject to k n X X 1X kwm k2 + C ξim 2 m=1 i=1 m6=y i hwyi , xi i + byi ≥ hwm , xi i + bm + 2 − ξim ξim ≥ 0 i = 1, . . . , n i = 1, . . . , n m 6= yi m 6= yi Decision Function f (x) = arg max(hwm , xi + bm ) Mehmet Gönen m Support Vector Machines and Kernel-Based Algorithms for Machine Learning For Further Reading Vladimir N. Vapnik Statistical Learning Theory John Wiley and Sons, 1998 Jason Weston and Chris Watkins Multi-Class Support Vector Machines Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, 1998. Chih-Wei Hsu and Chih-Jen Lin A Comparison of Methods for Multi-Class Support Vector Machines Neural Networks, IEEE Transactions on, 13(2):415–425, 2002. Ryan Rifkin and Aldebaro Klautau In Defense of One-Vs-All Classification Journal of Machine Learning Research, 5:101–141, 2004. Eddy Mayoraz and Ethem Alpaydın Support Vector Machines for Multi-Class Classification Engineering Applications of Bio-Inspired Artificial Neural Networks, 833–842, 1999. Mehmet Gönen Support Vector Machines and Kernel-Based Algorithms for Machine Learning