Regularization perspectives on support vector machines: Difference between revisions

Content deleted Content added
m punct., fmt., style
Line 1:
{{context|date=May 2012}}
'''Regularization perspectives on support -vector machines''' provide a way of interpreting [[support -vector machine]]s (SVMs) in the context of other machine -learning algorithms. SVM algorithms categorize [[multidimensional]] data, with the goal of fitting the [[training set]] data well, but also avoiding [[overfitting]], so that the solution [[generalize]]s to new data points. [[Regularization (mathematics)|Regularization]] algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high [[norm (mathematics)|norm]]s in some [[function space]]. Specifically, [[Tikhonov regularization]] algorithms choose a function that minimizeminimizes the sum of training -set error plus the function's norm. The training -set error can be calculated with different [[loss function]]s. For example, [[regularized least squares]] is a special case of Tikhonov regularization using the [[squared error loss]] as the loss function.<ref name="rosasco1">{{cite web |last=Rosasco |first=Lorenzo |title=Regularized Least-Squares and Support Vector Machines |url=http://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}</ref>
</ref>
 
Regularization perspectives on support -vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal |last=Cortes |first=Corinna |author2=Vladimir Vapnik |title=Support-Vector Networks |journal=Machine Learning |year=1995 |volume=20 |pages=273–297 |doi=10.1007/BF00994018 |url=http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP}}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine -learning]] techniques for avoiding overfitting, like [[regularization (mathematics)|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1"/><ref>{{cite book |last=Rifkin |first=Ryan |title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning |year=2002 |publisher=MIT (PhD thesis) |url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref><ref name="Lee 2012 67–81">{{cite journal |last=Lee |first=Yoonkyung |author1-link= Yoonkyung Lee |first2=Grace |last2=Wahba |author2-link=Grace Wahba |title=Multicategory Support Vector Machines |journal=Journal of the American Statistical Association |year=2012 |volume=99 |issue=465 |pages=67–81 |doi=10.1198/016214504000000098 |url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal |authors=Rosasco L., De Vito E., Caponnetto A., Piana M., Verri A. |title=Are Loss Functions All the Same |journal=Neural Computation |date=May 2004 |volume=16 |series=5 |pages=1063–1076 |doi=10.1162/089976604773135104 |url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104 |pmid=15070510}}</ref>
</ref><ref name="Lee 2012 67–81">{{cite journal|last=Lee|first=Yoonkyung|author1-link= Yoonkyung Lee |first2=Grace|last2=Wahba|author2-link=Grace Wahba |title=Multicategory Support Vector Machines|journal=Journal of the American Statistical Association|year=2012|volume=99|issue=465|pages=67–81|doi=10.1198/016214504000000098|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal|vauthors=Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A |title=Are Loss Functions All the Same|journal=Neural Computation|date=May 2004|volume=16|series=5|pages=1063–1076|doi=10.1162/089976604773135104|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104|pmid=15070510}}</ref>
 
==Theoretical background==
In the [[statistical learning theory]] framework, an [[algorithm]] is a strategy for choosing a [[function (mathematics)|function]] <math> f: \colon \mathbf X \to \mathbf Y </math> given a training set <math> S = \{(x_1, y_1), \ldots, (x_n, y_n)\}</math> of inputs, <math>x_i</math>, and their labels, <math>y_i</math> (the labels are usually <math>\pm1</math>). [[Regularization (mathematics)|Regularization]] strategies avoid [[overfitting]] by choosing a function that fits the data, but is not too complex. Specifically:
 
<math>f = \textoperatorname{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i, f(x_i)) + \lambda|\|f|\|^2_\mathcal{H}\right\} ,</math>,
 
where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine -learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V: \colon \mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>|\|\cdot|\|_\mathcal H</math> is a [[norm (mathematics)|norm]] on the hypothesis space of functions, and <math>\lambda \in \mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal |last=Wahba |first=Grace |author2=Yonghua Wang |title=When is the optimal regularization parameter insensitive to the choice of the loss function |journal=Communications in Statistics - Theory and Methods |year=1990 |volume=19 |issue=5 |pages=1685–1700 |doi=10.1080/03610929008830285 |url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285}}</ref>
 
When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \colon \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n \times n</math> [[symmetric]] [[Positive-definite kernel|positive -definite]] [[matrix (mathematics)|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref>See {{cite journal |last=Scholkopf |first=Bernhard |author2=Ralf Herbrich |author3=Alex Smola |title=A Generalized Representer Theorem |journal=Computational Learning Theory: Lecture Notes in Computer Science |year=2001 |volume=2111 |pages=416–426 |doi=10.1007/3-540-44581-1_27 |url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref> <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}</math>, and <math> |\|f|\|^2_{\mathcal H} = \langle f, f\rangle_\mathcal H = \sum_{i=1}^n \sum_{j=1}^n c_ic_jKc_i c_jK(x_i, x_j) = c^T \mathbf K c .</math>
 
==Special properties of the hinge loss==
[[File:Hinge and Misclassification Loss.png|Hinge and misclassification loss functions]]
 
The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-10–1 loss, which is 0 if <math>f(x_i) = y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e. the [[Heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-10–1 loss. The hinge loss, <math> V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+</math>, where <math>(s)_+ = \max(s, 0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0-10–1 misclassification loss function,<ref name="Lee 2012 67–81"/> and with infinite data returns the [[Bayes' theorem|Bayes]] -optimal solution:<ref name="Rosasco 2004 1063–1076"/><ref>{{cite journal |last=Lin |first=Yi |title=Support Vector Machines and the Bayes Rule in Classification |journal=Data Mining and Knowledge Discovery |date=July 2002 |volume=6 |issue=3 |pages=259–275 |doi=10.1023/A:1015469627679 |url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref>
 
<math>f_b(x) = \left\{\begin{matrixcases}
1, & p(1|x) > p(-1|x), \\
-1, & p(1|x) < p(-1|x).
\end{matrixcases}\right.</math>
 
==Derivation==
 
The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.<ref>For a detailed derivation, see {{cite book |last=Rifkin |first=Ryan |title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning |year=2002 |publisher=MIT (PhD thesis) |url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref> With the hinge loss,
 
<math> V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+,</math>
 
where <math>(s)_+ = \max(s, 0)</math>, the regularization problem becomes
 
<math>f = \textoperatorname{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n \big(1 - yf(x)\big)_+ + \lambda|\|f|\|^2_\mathcal{H}\right\} .</math>.
 
Multiplying by <math>1/(2\lambda)</math> yields
 
<math>f = \textoperatorname{arg}\min_{f\in\mathcal{H}}\left\{C\sum_{i=1}^n \big(1 - yf(x)\big)_+ + \frac{1}{2}|\|f|\|^2_\mathcal{H}\right\} </math>,
 
with <math>C = 1/(2\lambda n)</math>, which is equivalent to the standard SVM minimization problem.
Line 41 ⟶ 42:
{{Reflist}}
 
* {{cite journal|last=Evgeniou|first=Theodoros |author2=Massimiliano Pontil |author3=Tomaso Poggio|title=Regularization Networks and Support Vector Machines|journal=Advances in Computational Mathematics|year=2000|volume=13|issue=1|pages=1–50|doi=10.1023/A:1018946025316|url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}}
* {{cite web|last=Joachims|first=Thorsten|title=SVMlight|url=http://svmlight.joachims.org/}}
* {{cite book|last=Vapnik|first=Vladimir|title=The Nature of Statistical Learning Theory|year=1999|publisher=Springer-Verlag|___location=New York|isbn=0-387-98780-0|url=https://books.google.com/books?hl=en&lr=&id=sna9BaxVbj8C&oi=fnd&pg=PR7&dq=vapnik+the+nature+of+statistical+learning+theory&ots=onJeJ-it9b&sig=5g3uQT1umnkJKqcPaKUqpi10DMQ#v=onepage&q=vapnik%20the%20nature%20of%20statistical%20learning%20theory&f=false}}
 
[[Category:Support vector machines]]