Sequential minimal optimization: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Tags: Mobile edit Mobile web edit
 
(45 intermediate revisions by 29 users not shown)
Line 1:
{{short description|Algorithm for solving the quadratic programming problem from training SVMs}}
'''Sequential minimal optimization''' ('''SMO''') is an algorithm for solving the optimization problem which arises during the training of [[support vector machine]]s. It was invented by [[John Platt]] in 1998 at [[Microsoft Research]].<ref>{{Citation
{{Infobox Algorithm
|image=
|class=[[Optimization algorithm]] for training support vector machines
|data=
|time=O(''n''³)
|space=
}}
'''Sequential minimal optimization''' ('''SMO''') is an algorithm for solving the optimization[[quadratic programming]] (QP) problem whichthat arises during the training of [[support -vector machine]]s (SVM). It was invented by [[John Platt (computer scientist)|John Platt]] in 1998 at [[Microsoft Research]].<ref name = "Platt">{{CitationCite web
| last = Platt | first = John
| year = 1998
| title = Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines
| urlciteseerx = http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.4376&rep=rep1&type=pdf
| url = https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-98-14.pdf
}}</ref> SMO is widely used for training support vector machines and is implemented by the popular libsvm tool.<ref>Chih-Chung Chang and Chih-Jen Lin (2001). ''[http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf LIBSVM: a Library for Support Vector Machines]''.</ref><ref>Luca Zanni (2006) ''[http://jmlr.csail.mit.edu/papers/volume7/zanni06a/zanni06a.pdf Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems]''.</ref>
}}</ref> SMO is widely used for training support vector machines and is implemented by the popular [[LIBSVM]] tool.<ref>{{cite journal
 
|last1=Chang |first1=Chih-Chung
The publication of SMO algorithm in 1998 has generated a lot of excitement in SVM community, as previously available methods for SVM training were much more complex and required expensive third-party [[Quadratic programming|QP]] solvers.<ref>{{Citation
|last2=Lin |first2=Chih-Jen
|title=LIBSVM: A library for support vector machines
|journal=ACM Transactions on Intelligent Systems and Technology
|volume=2 |issue=3 |year=2011
|doi=10.1145/1961189.1961199
|s2cid=961425
}}</ref><ref>{{cite web |first=Luca |last=Zanni |date=2006 |url=http://jmlr.csail.mit.edu/papers/volume7/zanni06a/zanni06a.pdf |title=Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems }}</ref> The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party [[Quadratic programming|QP]] solvers.<ref>{{Citationcite thesis
| last = Rifkin | first = Ryan
| year = 2002
| url hdl= http://dspace.mit.edu/handle/1721.1/17549
| title = Everything Old is New Again: a Fresh Look at Historical Approaches in Machine Learning
| type = Ph.D. Thesis |publisher=Massachusetts Institute of Technology
| journal = Ph.D. thesis
| pages = p.18
}}</ref>
 
== Optimization problem ==
{{main|Support vector machine#Formalization}}
Consider a [[binary classification]] problem with a dataset (''x''<sub>1</sub>, ''y''<sub>1</sub>), ..., (''x''<sub>''n''</sub>, ''y''<sub>''n''</sub>), where ''x''<sub>''i''</sub> is an input vector and {{nobr|''y''<sub>''i''</sub> ∈ {-1, +1} }} is a binary label corresponding to it. A soft-margin [[support vector machine]] is trained by solving ana optimizationquadratic programming problem, whosewhich is expressed in the [[WoldeDual dualproblem|dual form]] isas the following convex quadratic programming problemfollows:
 
{{framebox|blue}}
:<math>\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac12 \sum_{i=1}^n \sum_{j=1}^n y_i y_j K(x_i, x_j) \alpha_i \alpha_j,</math>
:subject to:
::<math>0 \leq \alpha_i \leq C, \quad \mbox{ for } i=1, 2, \ldots, n,</math>
::<math>\sum_{i=1}^n y_i \alpha_i = 0,</math>
{{frame-footer}}
 
where ''C'' is an SVM hyperparameter and ''K''(''x''<sub>''i''</sub>, ''x''<sub>''j''</sub>) is the [[kernel function]], both supplied by the user; and the variables <math>\alpha_i</math> are [[Lagrange multiplier]]s.
 
SMO is an iterative algorithm for solving this dual SVM optimization problem.
 
== Algorithm ==
SMO is an iterative algorithm for solving the optimization problem described above. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers <math>\alpha_i</math>, the smallest possible problem involves two such multipliers. Then, for any two multipliers <math>\alpha_1</math> and <math>\alpha_2</math>, the constraints are reduced to:
{{section-stub}}
SMO breaks up large QP problems into a series of smallest possible QP problems, which are then solved analytically.
 
:<math>0 \leq \alpha_1, \alpha_2 \leq C,</math>
== References ==
:<math>y_1 \alpha_1 + y_2 \alpha_2 = k,</math>
{{reflist}}
 
and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. <math>k</math> is the negative of the sum over the rest of terms in the equality constraint, which is fixed in each iteration.
== External links ==
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM]
 
The algorithm proceeds as follows:
[[Category:Optimization algorithms]]
[[Category:Machine learning]]
 
# Find a Lagrange multiplier <math>\alpha_1</math> that violates the [[Karush–Kuhn–Tucker conditions|Karush–Kuhn–Tucker (KKT) conditions]] for the optimization problem.
{{statistics-stub}}
# Pick a second multiplier <math>\alpha_2</math> and optimize the pair <math>(\alpha_1,\alpha_2)</math>.
# Repeat steps 1 and 2 until convergence.
 
When all the Lagrange multipliers satisfy the KKT conditions (within a user-defined tolerance), the problem has been solved. Although this algorithm is guaranteed to converge, heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence. This is critical for large data sets since there are <math>n(n-1)/2</math> possible choices for <math>\alpha_i</math> and <math>\alpha_j</math>.
 
== Related work ==
The first approach to splitting large SVM learning problems into a series of smaller optimization tasks was proposed by [[Bernhard Boser]], [[Isabelle Guyon]], and [[Vladimir Vapnik]].<ref name="ReferenceA">{{Cite book | doi = 10.1145/130385.130401| chapter = A training algorithm for optimal margin classifiers| title = Proceedings of the fifth annual workshop on Computational learning theory - COLT '92| pages = 144| year = 1992| last1 = Boser | first1 = B. E. | last2 = Guyon | first2 = I. M. | last3 = Vapnik | first3 = V. N. | isbn = 978-0897914970| citeseerx = 10.1.1.21.3818| s2cid = 207165665}}</ref> It is known as the "chunking algorithm". The algorithm starts with a random subset of the data, solves this problem, and iteratively adds examples which violate the optimality conditions. One disadvantage of this algorithm is that it is necessary to solve QP-problems scaling with the number of SVs. On real world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.<ref name = "Platt"/>
 
In 1997, [[E. Osuna]], [[R. Freund]], and [[F. Girosi]] proved a theorem which suggests a whole new set of QP algorithms for SVMs.<ref>{{Cite book | doi = 10.1109/NNSP.1997.622408| chapter = An improved training algorithm for support vector machines| title = Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop| pages = 276–285| year = 1997| last1 = Osuna | first1 = E. | last2 = Freund | first2 = R. | last3 = Girosi | first3 = F. | isbn = 978-0-7803-4256-9| citeseerx = 10.1.1.392.7405| s2cid = 5667586}}</ref> By the virtue of this theorem a large QP problem can be broken down into a series of smaller QP sub-problems. A sequence of QP sub-problems that always add at least one violator of the [[Karush–Kuhn–Tucker conditions|Karush–Kuhn–Tucker (KKT) conditions]] is guaranteed to converge. The chunking algorithm obeys the conditions of the theorem, and hence will converge.<ref name = "Platt"/> The SMO algorithm can be considered a special case of the Osuna algorithm, where the size of the optimization is two and both Lagrange multipliers are replaced at every step with new multipliers that are chosen via good heuristics.<ref name = "Platt"/>
 
The SMO algorithm is closely related to a family of optimization algorithms called [[Bregman method]]s or row-action methods. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint.<ref name = "Platt"/>
 
== See also ==
* [[Kernel perceptron]]
 
== References ==
{{reflist|30em}}
{{Optimization algorithms}}
[[Category:Optimization algorithms and methods]]
[[Category:Support vector machines]]