Content deleted Content added
Mark viking (talk | contribs) added Category:Mathematical analysis using HotCat |
RjwilmsiBot (talk | contribs) m fixing page range dashes using AWB |
||
Line 3:
</ref>
Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal|last=Cortes|first=Corinna|coauthors=Vladimir Vapnik|title=Suppor-Vector Networks|journal=Machine Learning|year=1995|volume=20|pages=
</ref><ref>{{cite journal|last=Lee|first=Yoonkyung|coauthors=Grace Wahba|title=Multicategory Support Vector Machines|journal=Journal of the American Statistical Association|year=2012|volume=99|issue=465|pages=
</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref>{{cite journal|last=Rosasco|first=Lorenzo|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri|title=Are Loss Functions All the Same|journal=Neural Computation|year=2004|month=May|volume=16|series=5|pages=
</ref>
Line 13:
<math>f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i))+\lambda||f||^2_\mathcal{H}\right\} </math>,
where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V:\mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>||\cdot||_\mathcal H</math> is a [[norm (mathematics)|norm]] on the hypothesis space of functions, and <math>\lambda\in\mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal|last=Wahba|first=Grace|coauthors=Yonghua Wang|title=When is the optimal regularization parameter insensitive to the choice of the loss function|journal=Communications in Statistics - Theory and Methods|year=1990|volume=19|issue=5|pages=
When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n\times n</math> [[symmetric]] [[Positive-definite kernel|positive definite]] [[matrix (mathematics)|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref> See {{cite journal|last=Scholkopf|first=Bernhard|coauthors=Ralf Herbrich and Alex Smola|title=A Generalized Representer Theorem|journal=Computational Learning Theory: Lecture Notes in Computer Science|year=2001|volume=2111|pages=
==Special properties of the hinge loss==
Line 21:
The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if <math>f(x_i)=y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e the [[heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, <math> V(y_i,f(x_i)) = (1-yf(x))_+</math> where <math>(s)_+ = max(s,0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0-1 misclassification loss function,<ref>{{cite journal|last=Lee|first=Yoonkyung|coauthors=Grace Wahba|title=Multicategory Support Vector Machines|journal=Journal of the American Statistical Association|year=2012|volume=99|issue=465|pages=
<math>f_b(x) = \left\{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}\right.</math>
Line 39:
{{Reflist}}
*{{cite journal|last=Evgeniou|first=Theodoros|coauthors=Massimiliano Pontil and Tomaso Poggio|title=Regularization Networks and Support Vector Machines|journal=Advances in Computational Mathematics|year=2000|volume=13|issue=1|pages=
*{{cite web|last=Joachims|first=Thorsten|title=SVMlight|url=http://svmlight.joachims.org/}}
|