Revision as of 14:39, 28 December 2012 edit Mark viking (talk \| contribs) Extended confirmed users 19,329 edits added Category:Mathematical analysis using HotCat ← Previous edit		Revision as of 19:40, 24 February 2013 edit undo RjwilmsiBot (talk \| contribs) Bots, Pending changes reviewers 1,602,950 edits m fixing page range dashes using AWB Next edit →
Line 3: </ref> Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal\|last=Cortes\|first=Corinna\|coauthors=Vladimir Vapnik\|title=Suppor-Vector Networks\|journal=Machine Learning\|year=1995\|volume=20\|pages=~~273-297~~273–297\|doi=10.1007/BF00994018\|url=http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP}}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine learning]] techniques for avoiding overfitting like [[regularization (mathematics)\|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1"/><ref>{{cite book\|last=Rifkin\|first=Ryan\|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning\|year=2002\|publisher=MIT (PhD thesis)\|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}} </ref><ref>{{cite journal\|last=Lee\|first=Yoonkyung\|coauthors=Grace Wahba\|title=Multicategory Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=~~67-81~~67–81\|doi=10.1198/016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}} </ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref>{{cite journal\|last=Rosasco\|first=Lorenzo\|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri\|title=Are Loss Functions All the Same\|journal=Neural Computation\|year=2004\|month=May\|volume=16\|series=5\|pages=~~1063-1076~~1063–1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104}} </ref> Line 13: <math>f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i))+\lambda\|\|f\|\|^2_\mathcal{H}\right\} </math>, where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V:\mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>\|\|\cdot\|\|_\mathcal H</math> is a [[norm (mathematics)\|norm]] on the hypothesis space of functions, and <math>\lambda\in\mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal\|last=Wahba\|first=Grace\|coauthors=Yonghua Wang\|title=When is the optimal regularization parameter insensitive to the choice of the loss function\|journal=Communications in Statistics - Theory and Methods\|year=1990\|volume=19\|issue=5\|pages=~~1685-1700~~1685–1700\|doi=10.1080/03610929008830285\|url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285}}</ref> When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n\times n</math> [[symmetric]] [[Positive-definite kernel\|positive definite]] [[matrix (mathematics)\|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref> See {{cite journal\|last=Scholkopf\|first=Bernhard\|coauthors=Ralf Herbrich and Alex Smola\|title=A Generalized Representer Theorem\|journal=Computational Learning Theory: Lecture Notes in Computer Science\|year=2001\|volume=2111\|pages=~~416-426~~416–426\|doi=10.1007/3-540-44581-1_27\|url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref> <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}</math>, and <math> \|\|f\|\|^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c </math> ==Special properties of the hinge loss== Line 21: The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if <math>f(x_i)=y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e the [[heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function\|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, <math> V(y_i,f(x_i)) = (1-yf(x))_+</math> where <math>(s)_+ = max(s,0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0-1 misclassification loss function,<ref>{{cite journal\|last=Lee\|first=Yoonkyung\|coauthors=Grace Wahba\|title=Multicategory Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=~~67-81~~67–81\|doi=10.1198/016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref> and with infinite data returns the [[Bayes' theorem\|Bayes]] optimal solution:<ref>{{cite journal\|last=Lin\|first=Yi\|title=Support Vector Machines and the Bayes Rule in Classification\|journal=Data Mining and Knowledge Discovery\|year=2002\|month=July\|volume=6\|issue=3\|pages=~~259-275~~259–275\|doi=10.1023/A:1015469627679\|url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref><ref>{{cite journal\|last=Rosasco\|first=Lorenzo\|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri\|title=Are Loss Functions All the Same\|journal=Neural Computation\|year=2004\|month=May\|volume=16\|series=5\|pages=~~1063-1076~~1063–1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104}}</ref> <math>f_b(x) = \left\{\begin{matrix}1&p(1\|x)>p(-1\|x)\\-1&p(1\|x)<p(-1\|x)\end{matrix}\right.</math> Line 39: {{Reflist}} {{cite journal\|last=Evgeniou\|first=Theodoros\|coauthors=Massimiliano Pontil and Tomaso Poggio\|title=Regularization Networks and Support Vector Machines\|journal=Advances in Computational Mathematics\|year=2000\|volume=13\|issue=1\|pages=~~1-50~~1–50\|doi=10.1023/A:1018946025316\|url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}} {{cite web\|last=Joachims\|first=Thorsten\|title=SVMlight\|url=http://svmlight.joachims.org/}}

Regularization perspectives on support vector machines: Difference between revisions