Regularization perspectives on support vector machines: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:49, 18 November 2016 edit Marcocapelle (talk \| contribs) Extended confirmed users, Page movers 592,559 edits removed Category:Estimation theory using HotCat ← Previous edit		Latest revision as of 06:07, 17 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,873,265 edits Added isbn. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 1746/2277
(22 intermediate revisions by 14 users not shown)
Line 1: Within [[mathematical analysis]], '''Regularization perspectives on support-vector machines''' provide a way of interpreting [[support-vector machine]]s (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the [[training set]] data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids [[overfitting]] via [[Tikhonov regularization]] and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower [[Mean squared error]] predict better or generalize better when given unseen data. ~~{{context\|date=May 2012}}~~ '''Regularization perspectives on support vector machines''' provide a way of interpreting [[support vector machine]]s (SVMs) in the context of other machine learning algorithms. SVM algorithms categorize [[multidimensional]] data, with the goal of fitting the [[training set]] data well, but also avoiding [[overfitting]], so that the solution [[generalize]]s to new data points. [[Regularization (mathematics)\|Regularization]] algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high [[norm (mathematics)\|norm]]s in some [[function space]]. Specifically, [[Tikhonov regularization]] algorithms choose a function that minimize the sum of training set error plus the function's norm. The training set error can be calculated with different [[loss function]]s. For example, [[regularized least squares]] is a special case of Tikhonov regularization using the [[squared error loss]] as the loss function.<ref name="rosasco1">{{cite web\|last=Rosasco\|first=Lorenzo\|title=Regularized Least-Squares and Support Vector Machines\|url=http://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}} </ref>▼ Specifically, [[Tikhonov regularization]] algorithms produce a decision boundary that minimizes the average training-set error and constrain the [[Decision boundary]] not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics. Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal\|last=Cortes\|first=Corinna\|author2=Vladimir Vapnik \|title=Suppor-Vector Networks\|journal=Machine Learning\|year=1995\|volume=20\|pages=273–297\|doi=10.1007/BF00994018\|url=http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP}}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine learning]] techniques for avoiding overfitting like [[regularization (mathematics)\|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1"/><ref>{{cite book\|last=Rifkin\|first=Ryan\|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning\|year=2002\|publisher=MIT (PhD thesis)\|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}▼ </ref><ref name="Lee 2012 67–81">{{cite journal\|last=Lee\|first=Yoonkyung\|author1-link= Yoonkyung Lee \|first2=Grace\|last2=Wahba\|author2-link=Grace Wahba \|title=Multicategory Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=67–81\|doi=10.1198/016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal\|vauthors=Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A \|title=Are Loss Functions All the Same\|journal=Neural Computation\|date=May 2004\|volume=16\|series=5\|pages=1063–1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104\|pmid=15070510}}</ref> ▲Regularization perspectives on support -vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal \|last=Cortes \|first=Corinna \|author2=Vladimir Vapnik \|title=~~Suppor~~Support-Vector Networks \|journal=Machine Learning \|year=1995 \|volume=20 \|issue=3 \|pages=273–297 \|doi=10.1007/BF00994018 \|~~url~~doi-access=~~http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP~~free }}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine -learning]] techniques for avoiding overfitting, like [[regularization (mathematics)\|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1">{{cite web \|last=Rosasco \|first=Lorenzo \|title=Regularized Least-Squares and Support Vector Machines \|url=https://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}</ref><ref>{{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref><ref name="Lee 2012 67–81">{{cite journal \|last1=Lee \|first1=Yoonkyung \|author1-link= Yoonkyung Lee \|first2=Grace \|last2=Wahba \|author2-link=Grace Wahba \|title=Multicategory Support Vector Machines \|journal=Journal of the American Statistical Association \|year=2012 \|volume=99 \|issue=465 \|pages=67–81 \|doi=10.1198/016214504000000098 \|s2cid=261035640 \|citeseerx=10.1.1.22.1879 }}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal \|author=Rosasco L. \|author2=De Vito E. \|author3=Caponnetto A. \|author4=Piana M. \|author5=Verri A. \|title=Are Loss Functions All the Same \|journal=Neural Computation \|date=May 2004 \|volume=16 \|issue=5 \|series=5 \|pages=1063–1076 \|doi=10.1162/089976604773135104 \|pmid=15070510\|citeseerx=10.1.1.109.6786 \|s2cid=11845688 }}</ref> ==Theoretical background== In the [[statistical learning theory]] framework, an [[algorithm]] is a strategy for choosing a [[function (mathematics)\|function]] <math> f: \colon \mathbf X \to \mathbf Y </math> given a training set <math> S = \{(x_1, y_1), \ldots, (x_n, y_n)\}</math> of inputs, <math>x_i</math>, and their labels, <math>y_i</math> (the labels are usually <math>\pm1</math>). [[Regularization (mathematics)\|Regularization]] strategies avoid [[overfitting]] by choosing a function that fits the data, but is not too complex. Specifically: : <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin}\left\{\frac{ 1}{ n} \sum_{i=1}^n V(y_i, f(x_i)) + \lambda\|\\|f\|\\|^2_\mathcal{H}\right\} ,</math>, where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine -learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V: \colon \mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>\|\\|\cdot\|\\|_\mathcal H</math> is a [[norm (mathematics)\|norm]] on the hypothesis space of functions, and <math>\lambda \in \mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal \|last=Wahba \|first=Grace \|author2=Yonghua Wang \|title=When is the optimal regularization parameter insensitive to the choice of the loss function \|journal=Communications in Statistics -– Theory and Methods \|year=1990 \|volume=19 \|issue=5 \|pages=1685–1700 \|doi=10.1080/03610929008830285~~\|url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285~~ }}</ref> When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \colon \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n \times n</math> [[symmetric]] [[Positive-definite kernel\|positive -definite]] [[matrix (mathematics)\|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref>~~See~~ {{cite journal\|last=Scholkopf\|first=Bernhard \|author2=Ralf Herbrich \|author3=Alex Smola\|title=A Generalized Representer Theorem\|journal=Computational Learning Theory: Lecture Notes in Computer Science\|year=2001\|volume=2111\|pages=416–426\|doi=10.1007/3-540-44581-1_27\|url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref> <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}</math>, and <math> \|\|f\|\|^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c </math>conference \| last1 = Schölkopf \| first1 = Bernhard \| last2 = Herbrich \| first2 = Ralf \| last3 = Smola \| first3 = Alexander J. \| editor1-last = Helmbold \| editor1-first = David P. \| editor2-last = Williamson \| editor2-first = Robert C. \| contribution = A generalized representer theorem \| doi = 10.1007/3-540-44581-1_27 \| pages = 416–426 \| publisher = Springer \| series = Lecture Notes in Computer Science \| title = Computational Learning Theory, 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 16–19, 2001, Proceedings \| volume = 2111 \| year = 2001\| isbn = 978-3-540-42343-0 ▲ }}</ref> : <math>f(x_i) = \sum_{j=1}^n c_j \mathbf K_{ij}, \text{ and } \\|f\\|^2_{\mathcal H} = \langle f, f\rangle_\mathcal H = \sum_{i=1}^n \sum_{j=1}^n c_i c_jK(x_i, x_j) = c^T \mathbf K c.</math> ==Special properties of the hinge loss== [[File:Hinge and Misclassification Loss.png\|Hinge and misclassification loss functions]] The simplest and most intuitive loss function for categorization is the misclassification loss, or ~~0-1~~0–1 loss, which is 0 if <math>f(x_i) = y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e. the [[Heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function\|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the ~~0-1~~0–1 loss. The hinge loss, <math> V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+</math>, where <math>(s)_+ = \max(s, 0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the ~~0-1~~0–1 misclassification loss function,<ref name="Lee 2012 67–81"/> and with infinite data returns the [[Bayes' theorem\|Bayes]] -optimal solution:<ref name="Rosasco 2004 1063–1076"/><ref>{{cite journal \|last=Lin \|first=Yi \|title=Support Vector Machines and the Bayes Rule in Classification \|journal=Data Mining and Knowledge Discovery \|date=July 2002 \|volume=6 \|issue=3 \|pages=259–275 \|doi=10.1023/A:1015469627679 \|s2cid=24759201 \|url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref> : <math>f_b(x) = ~~\left\{~~\begin{~~matrix~~cases}~~1&p(1\|x)>p(-1\|x)\\-1&p(1\|x)<p(-1\|x)\end{matrix}\right.</math>~~ 1, & p(1\mid x) > p(-1\mid x), \\ -1, & p(1\mid x) < p(-1\mid x). \end{cases}</math> ==Derivation== The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.<ref>For a detailed derivation, see {{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref> With the hinge loss, : <math> V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+,</math> where <math>(s)_+ = \max(s, 0)</math>, the regularization problem becomes : <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{\frac{ 1}{ n} \sum_{i=1}^n \big(1 - yf(x)\big)_+ + \lambda\|\\|f\|\\|^2_\mathcal{H}\right\} .</math>. Multiplying by <math>1/(2\lambda)</math> yields : <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{C\sum_{i=1}^n \big(1 - yf(x)\big)_+ + \frac{1}{2}\|\\|f\|\\|^2_\mathcal{H}\right\} </math>, with <math>C = 1/(2\lambda n)</math>, which is equivalent to the standard SVM minimization problem. Line 41 ⟶ 58: {{Reflist}} * {{cite journal\|last=Evgeniou\|first=Theodoros \|author2=Massimiliano Pontil \|author3=Tomaso Poggio\|title=Regularization Networks and Support Vector Machines\|journal=Advances in Computational Mathematics\|year=2000\|volume=13\|issue=1\|pages=1–50\|doi=10.1023/A:1018946025316\|s2cid=70866 \|url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}} * {{cite web\|last=Joachims\|first=Thorsten\|title=SVMlight\|url=http://svmlight.joachims.org/\|access-date=2012-05-18\|archive-url=https://web.archive.org/web/20150419060655/http://svmlight.joachims.org/\|archive-date=2015-04-19}} * {{cite book\|last=Vapnik\|first=Vladimir\|title=The Nature of Statistical Learning Theory\|year=1999\|publisher=Springer-Verlag\|___location=New York\|isbn=978-0-387-98780-04\|url=https://books.google.com/books?~~hl=en&lr=&~~id=sna9BaxVbj8C&~~oi=fnd&pg=PR7&dq~~q=vapnik+the+nature+of+statistical+learning+theory&~~ots~~pg=~~onJeJ-it9b&sig=5g3uQT1umnkJKqcPaKUqpi10DMQ#v=onepage&q=vapnik%20the%20nature%20of%20statistical%20learning%20theory&f=false~~PR7}} [[Category:Support vector machines]]