Regularization perspectives on support vector machines: Difference between revisions

Content deleted Content added
Monkbot (talk | contribs)
Line 3:
</ref>
 
Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal|last=Cortes|first=Corinna|coauthorsauthor2=Vladimir Vapnik |title=Suppor-Vector Networks|journal=Machine Learning|year=1995|volume=20|pages=273–297|doi=10.1007/BF00994018|url=http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP}}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine learning]] techniques for avoiding overfitting like [[regularization (mathematics)|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1"/><ref>{{cite book|last=Rifkin|first=Ryan|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning|year=2002|publisher=MIT (PhD thesis)|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}
</ref><ref name="Lee 2012 67–81">{{cite journal|last=Lee|first=Yoonkyung|coauthorsauthor2=Grace Wahba |title=Multicategory Support Vector Machines|journal=Journal of the American Statistical Association|year=2012|volume=99|issue=465|pages=67–81|doi=10.1198/016214504000000098|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal|last=Rosasco|first=Lorenzo|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri|title=Are Loss Functions All the Same|journal=Neural Computation|date=May 2004|volume=16|series=5|pages=1063–1076|doi=10.1162/089976604773135104|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104|pmid=15070510}}</ref>
 
==Theoretical background==
Line 11:
<math>f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i))+\lambda||f||^2_\mathcal{H}\right\} </math>,
 
where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V:\mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>||\cdot||_\mathcal H</math> is a [[norm (mathematics)|norm]] on the hypothesis space of functions, and <math>\lambda\in\mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal|last=Wahba|first=Grace|coauthorsauthor2=Yonghua Wang |title=When is the optimal regularization parameter insensitive to the choice of the loss function|journal=Communications in Statistics - Theory and Methods|year=1990|volume=19|issue=5|pages=1685–1700|doi=10.1080/03610929008830285|url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285}}</ref>
 
When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n\times n</math> [[symmetric]] [[Positive-definite kernel|positive definite]] [[matrix (mathematics)|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref>See {{cite journal|last=Scholkopf|first=Bernhard|coauthors=Ralf Herbrich and Alex Smola|title=A Generalized Representer Theorem|journal=Computational Learning Theory: Lecture Notes in Computer Science|year=2001|volume=2111|pages=416–426|doi=10.1007/3-540-44581-1_27|url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref> <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}</math>, and <math> ||f||^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c </math>