Regularization perspectives on support vector machines: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:36, 29 May 2012 edit Elmackev (talk \| contribs) 131 edits No edit summary ← Previous edit		Latest revision as of 06:07, 17 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,873,349 edits Added isbn. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 1746/2277
(44 intermediate revisions by 28 users not shown)
Line 1: Within [[mathematical analysis]], '''Regularization perspectives on support-vector machines''' provide a way of interpreting [[support-vector machine]]s (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the [[training set]] data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids [[overfitting]] via [[Tikhonov regularization]] and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower [[Mean squared error]] predict better or generalize better when given unseen data. ~~{{context\|date=May 2012}}~~ '''Regularization perspectives on support vector machines''' provide a way of interpreting [[support vector machine]]s (SVMs) in the context of other machine learning algorithms. SVM algorithms categorize [[multidimensional]] data, with the goal of fitting the [[training set]] data well, but also avoiding [[overfitting]], so that the solution [[generalize]]s to new data points. [[Regularization]] algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high [[norm]]s in some [[function space]]. Specifically, [[Tikhonov regularization]] algorithms choose a function that minimize the sum of training set error plus the function's norm. The training set error can be calculated with different [[loss function]]s. For example, [[regularized least squares]] is a special case of Tikhonov regularization using the [[squared error loss]] as the loss function.<ref> {{cite web\|last=Rosasco\|first=Lorenzo\|title=Regularized Least-Squares and Support Vector Machines\|url=http://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}, </ref> ▼ Specifically, [[Tikhonov regularization]] algorithms produce a decision boundary that minimizes the average training-set error and constrain the [[Decision boundary]] not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics. Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal\|last=Cortes\|first=Corinna\|coauthors=Vladimir Vapnik\|title=Suppor-Vector Networks\|journal=Machine Learning\|year=1995\|volume=20\|pages=273-297\|doi=10.1007/BF00994018\|url=http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP}}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine learning]] techniques for avoiding overfitting like [[regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref> {{cite web\|last=Rosasco\|first=Lorenzo\|title=Regularized Least-Squares and Support Vector Machines\|url=http://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}, ▼ </ref><ref>{{cite book\|last=Rifkin\|first=Ryan\|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning\|year=2002\|publisher=MIT (PhD thesis)\|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}} ▲Regularization perspectives on support -vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal \|last=Cortes \|first=Corinna \|~~coauthors~~author2=Vladimir Vapnik \|title=~~Suppor~~Support-Vector Networks \|journal=Machine Learning \|year=1995 \|volume=20 \|issue=3 \|pages=~~273-297~~273–297 \|doi=10.1007/BF00994018 \|~~url~~doi-access=~~http://www.springerlink.com/content/k238jx04hm87j80g/?MUD=MP~~free }}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine -learning]] techniques for avoiding overfitting, like [[regularization (mathematics)\|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref> name="rosasco1">{{cite web \|last=Rosasco \|first=Lorenzo \|title=Regularized Least-Squares and Support Vector Machines \|url=~~http~~https://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}</ref><ref>{{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref><ref name="Lee 2012 67–81">{{cite journal \|last1=Lee \|first1=Yoonkyung \|author1-link= Yoonkyung Lee \|first2=Grace \|last2=Wahba \|author2-link=Grace Wahba \|title=Multicategory Support Vector Machines \|journal=Journal of the American Statistical Association \|year=2012 \|volume=99 \|issue=465 \|pages=67–81 \|doi=10.1198/016214504000000098 \|s2cid=261035640 \|citeseerx=10.1.1.22.1879 }}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal \|author=Rosasco L. \|author2=De Vito E. \|author3=Caponnetto A. \|author4=Piana M. \|author5=Verri A. \|title=Are Loss Functions All the Same \|journal=Neural Computation \|date=May 2004 \|volume=16 \|issue=5 \|series=5 \|pages=1063–1076 \|doi=10.1162/089976604773135104 \|pmid=15070510\|citeseerx=10.1.1.109.6786 \|s2cid=11845688 }}</ref> </ref><ref>{{cite journal\|last=Lee\|first=Yoonkyung\|coauthors=Grace Wahba\|title=Multicategory Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=67-81\|doi=10.1198/016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}} </ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref>{{cite journal\|last=Rosasco\|first=Lorenzo\|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri\|title=Are Loss Functions All the Same\|journal=Neural Computation\|year=2004\|month=May\|volume=16\|series=5\|pages=1063-1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104}} ~~</ref>~~ ==Theoretical background== In the [[statistical learning theory]] framework, an [[algorithm]] is a strategy for choosing a [[function (mathematics)\|function]] <math> f: \colon \mathbf X \to \mathbf Y </math> given a training set <math> S = \{(x_1, y_1), \ldots, (x_n, y_n)\}</math> of inputs, <math>x_i</math>, and their labels, <math>y_i</math> (the labels are usually <math>\pm1</math>). [[Regularization (mathematics)\|Regularization]] strategies avoid [[overfitting]] by choosing a function that fits the data, but is not too complex. Specifically: : <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin}\left\{\frac{ 1}{ n} \sum_{i=1}^n V(y_i, f(x_i)) + \lambda\|\\|f\|\\|^2_\mathcal{H}\right\} ,</math>, where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>~~This~~A hypothesis space is the set of functions isused to model the data in a machine-learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of ~~all~~functions ~~the~~with ~~functions~~norm ~~we're~~formed ~~allowing~~from the ~~algorithm~~loss ~~to pick~~function.</ref> of functions, <math>V: \colon \mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>\|\\|\cdot\|\\|_\mathcal H</math> is a [[norm (mathematics)\|norm]] on the hypothesis space of functions, and <math>\lambda \in \mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal \|last=Wahba \|first=Grace \|~~coauthors~~author2=Yonghua Wang \|title=When is the optimal regularization parameter insensitive to the choice of the loss function \|journal=Communications in Statistics -– Theory and Methods \|year=1990 \|volume=19 \|issue=5 \|pages=~~1685-1700~~1685–1700 \|doi=10.1080/03610929008830285~~\|url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285~~ }}</ref> . When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K: \colon \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n \times n</math> [[symmetric]] [[~~positive~~Positive-definite kernel\|positive-definite]] [[matrix (mathematics)\|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref> ~~See~~ {{cite journal\|last=Scholkopf\|first=Bernhard\|coauthors=Ralf Herbrich and Alex Smola\|title=A Generalized Representer Theorem\|journal=Computational Learning Theory: Lecture Notes in Computer Science\|year=2001\|volume=2111\|pages=416-426\|doi=10.1007/3-540-44581-1_27\|url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref>, <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}</math>, and <math> \|\|f\|\|^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c </math>conference \| last1 = Schölkopf \| first1 = Bernhard \| last2 = Herbrich \| first2 = Ralf \| last3 = Smola \| first3 = Alexander J. \| editor1-last = Helmbold \| editor1-first = David P. \| editor2-last = Williamson \| editor2-first = Robert C. \| contribution = A generalized representer theorem \| doi = 10.1007/3-540-44581-1_27 \| pages = 416–426 \| publisher = Springer \| series = Lecture Notes in Computer Science \| title = Computational Learning Theory, 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 16–19, 2001, Proceedings \| volume = 2111 \| year = 2001\| isbn = 978-3-540-42343-0 ▲ }}</ref> : <math>f(x_i) = \sum_{j=1}^n c_j \mathbf K_{ij}, \text{ and } \\|f\\|^2_{\mathcal H} = \langle f, f\rangle_\mathcal H = \sum_{i=1}^n \sum_{j=1}^n c_i c_jK(x_i, x_j) = c^T \mathbf K c.</math> ==Special properties of the hinge loss== [[File:Hinge and Misclassification Loss.png\|Hinge and misclassification loss functions]] The simplest and most intuitive loss function for categorization is the misclassification loss, or ~~0-1~~0–1 loss, which is 0 if <math>f(x_i) = y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e. the [[~~heaviside~~Heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function\|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the ~~0-1~~0–1 loss. The hinge loss, <math> V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+</math>, where <math>(s)_+ = \max(s, 0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the ~~0-1~~0–1 misclassification loss function,<ref~~>{{cite~~ ~~journal\|last~~name="Lee~~\|first=Yoonkyung\|coauthors=Grace~~ ~~Wahba\|title=Multicategory~~2012 ~~Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=67-81\|doi=10.1198~~67–81"/~~016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref~~>, and with infinite data returns the [[Bayes' theorem\|Bayes]] -optimal solution:<ref name="Rosasco 2004 1063–1076"/><ref>{{cite journal \|last=Lin \|first=Yi \|title=Support Vector Machines and the Bayes Rule in Classification \|journal=Data Mining and Knowledge Discovery \|~~year=2002\|month~~date=July 2002 \|volume=6 \|issue=3 \|pages=~~259-275~~259–275 \|doi=10.1023/A:1015469627679 \|s2cid=24759201 \|url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref> <ref>{{cite journal\|last=Rosasco\|first=Lorenzo\|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri\|title=Are Loss Functions All the Same\|journal=Neural Computation\|year=2004\|month=May\|volume=16\|series=5\|pages=1063-1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104}}</ref> ▼ : <math>f_b(x) = \begin{cases} ▲The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if <math>f(x_i)=y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e the [[heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, <math> V(y_i,f(x_i)) = (1-yf(x))_+</math> where <math>(s)_+ = max(s,0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0-1 misclassification loss function<ref>{{cite journal\|last=Lee\|first=Yoonkyung\|coauthors=Grace Wahba\|title=Multicategory Support Vector Machines\|journal=Journal of the American Statistical Association\|year=2012\|volume=99\|issue=465\|pages=67-81\|doi=10.1198/016214504000000098\|url=http://www.tandfonline.com/doi/abs/10.1198/016214504000000098}}</ref>, and with infinite data returns the [[Bayes]] optimal solution:<ref>{{cite journal\|last=Lin\|first=Yi\|title=Support Vector Machines and the Bayes Rule in Classification\|journal=Data Mining and Knowledge Discovery\|year=2002\|month=July\|volume=6\|issue=3\|pages=259-275\|doi=10.1023/A:1015469627679\|url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref> <ref>{{cite journal\|last=Rosasco\|first=Lorenzo\|coauthors=Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri\|title=Are Loss Functions All the Same\|journal=Neural Computation\|year=2004\|month=May\|volume=16\|series=5\|pages=1063-1076\|doi=10.1162/089976604773135104\|url=http://www.mitpressjournals.org/doi/pdf/10.1162/089976604773135104}}</ref> 1, & p(1\mid x) > p(-1\mid x), \\ -1, & p(1\mid x) < p(-1\mid x). \end{cases}</math> ==Derivation== ~~<math>f_b(x) = \left\{\begin{matrix}1&p(1\|x)>p(-1\|x)\\-1&p(1\|x)<p(-1\|x)\end{matrix}\right.</math>~~ ~~==Derivation~~The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.<ref>For a detailed derivation, see {{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref>== With the hinge loss To show that SVM is indeed a special case of Tikhonov regularization using the hinge loss, we will first state the Tikhonov regularization problem with the hinge loss, then demonstate that it is equivalent to traditional formulations of SVM. With the hinge loss, <math> V(y_i,f(x_i)) = (1-yf(x))_+</math> where <math>(s)_+ = max(s,0)</math>, the regularization problem becomes: : <math>V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+,</math> <math>f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n (1-yf(x))_+ +\lambda\|\|f\|\|^2_\mathcal{H}\right\} </math>,▼ ~~However if we set~~where <math>C(s)_+ = \~~frac{1}{2\lambda~~max(s, n}0)</math>, wethe regularization problem ~~get:~~becomes : <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{C\frac 1 n \sum_{i=1}^n \big(1 - yf(x)\big)_+ + \lambda\~~frac{1}{2}\|~~\|f\|\\|^2_\mathcal{H}\right\} .</math>. Multiplying by <math>1/(2\lambda)</math> yields This is equivalent to the standard SVM minimization problem.▼ ▲: <math>f = \~~text{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{~~\frac{1}{n}~~C\sum_{i=1}^n \big(1 - yf(x)\big)_+ + \frac{1}{2}\~~lambda\|~~\|f\|\\|^2_\mathcal{H}\right\} </math>, ==Notes and References==▼ {{Reflist}}▼ ▲~~This~~with <math>C = 1/(2\lambda n)</math>, which is equivalent to the standard SVM minimization problem. {{cite journal\|last=Evgeniou\|first=Theodoros\|coauthors=Massimiliano Pontil and Tomaso Poggio\|title=Regularization Networks and Support Vector Machines\|journal=Advances in Computational Mathematics\|year=2000\|volume=13\|issue=1\|pages=1-50\|doi=10.1023/A:1018946025316\|url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}}▼ ▲==Notes and ~~References~~references== {{cite web\|last=Joachims\|first=Thorsten\|title=SVMlight\|url=http://svmlight.joachims.org/}}▼ ▲{{Reflist}} ▲* {{cite journal\|last=Evgeniou\|first=Theodoros \|~~coauthors~~author2=Massimiliano Pontil ~~and~~ \|author3=Tomaso Poggio\|title=Regularization Networks and Support Vector Machines\|journal=Advances in Computational Mathematics\|year=2000\|volume=13\|issue=1\|pages=~~1-50~~1–50\|doi=10.1023/A:1018946025316\|s2cid=70866 \|url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}} {{cite book\|last=Vapnik\|first=Vladimir\|title=The Nature of Statistical Learning Theory\|year=1999\|publisher=Springer-Verlag\|___location=New York\|isbn=0-387-98780-0\|url=http://books.google.com/books?hl=en&lr=&id=sna9BaxVbj8C&oi=fnd&pg=PR7&dq=vapnik+the+nature+of+statistical+learning+theory&ots=onJeJ-it9b&sig=5g3uQT1umnkJKqcPaKUqpi10DMQ#v=onepage&q=vapnik%20the%20nature%20of%20statistical%20learning%20theory&f=false}}▼ ▲ {{cite web\|last=Joachims\|first=Thorsten\|title=SVMlight\|url=http://svmlight.joachims.org/\|access-date=2012-05-18\|archive-url=https://web.archive.org/web/20150419060655/http://svmlight.joachims.org/\|archive-date=2015-04-19}} ▲* {{cite book\|last=Vapnik\|first=Vladimir\|title=The Nature of Statistical Learning Theory\|year=1999\|publisher=Springer-Verlag\|___location=New York\|isbn=978-0-387-98780-04\|url=~~http~~https://books.google.com/books?~~hl=en&lr=&~~id=sna9BaxVbj8C&~~oi=fnd&pg=PR7&dq~~q=vapnik+the+nature+of+statistical+learning+theory&~~ots~~pg=~~onJeJ-it9b&sig=5g3uQT1umnkJKqcPaKUqpi10DMQ#v=onepage&q=vapnik%20the%20nature%20of%20statistical%20learning%20theory&f=false~~PR7}} [[Category:Support vector machines]] [[Category:~~Estimation~~Mathematical ~~theory~~analysis]]