Content deleted Content added
m cite repair; |
Citation bot (talk | contribs) Add: s2cid. | Use this bot. Report bugs. | Suggested by Corvus florensis | #UCB_webform 576/2500 |
||
Line 3:
Specifically, [[Tikhonov regularization]] algorithms produce a decision boundary that minimizes the average training-set error and constrain the [[Decision boundary]] not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics.
Regularization perspectives on support-vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal |last=Cortes |first=Corinna |author2=Vladimir Vapnik |title=Support-Vector Networks |journal=Machine Learning |year=1995 |volume=20 |issue=3 |pages=273–297 |doi=10.1007/BF00994018 |doi-access=free }}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine-learning]] techniques for avoiding overfitting, like [[regularization (mathematics)|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1">{{cite web |last=Rosasco |first=Lorenzo |title=Regularized Least-Squares and Support Vector Machines |url=https://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}</ref><ref>{{cite book |last=Rifkin |first=Ryan |title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning |year=2002 |publisher=MIT (PhD thesis) |url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref><ref name="Lee 2012 67–81">{{cite journal |last1=Lee |first1=Yoonkyung |author1-link= Yoonkyung Lee |first2=Grace |last2=Wahba |author2-link=Grace Wahba |title=Multicategory Support Vector Machines |journal=Journal of the American Statistical Association |year=2012 |volume=99 |issue=465 |pages=67–81 |doi=10.1198/016214504000000098 |s2cid=261035640 }}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal |author=Rosasco L. |author2=De Vito E. |author3=Caponnetto A. |author4=Piana M. |author5=Verri A. |title=Are Loss Functions All the Same |journal=Neural Computation |date=May 2004 |volume=16 |issue=5 |series=5 |pages=1063–1076 |doi=10.1162/089976604773135104 |pmid=15070510|citeseerx=10.1.1.109.6786 |s2cid=11845688 }}</ref>
==Theoretical background==
Line 18:
[[File:Hinge and Misclassification Loss.png|Hinge and misclassification loss functions]]
The simplest and most intuitive loss function for categorization is the misclassification loss, or 0–1 loss, which is 0 if <math>f(x_i) = y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e. the [[Heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0–1 loss. The hinge loss, <math>V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+</math>, where <math>(s)_+ = \max(s, 0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0–1 misclassification loss function,<ref name="Lee 2012 67–81"/> and with infinite data returns the [[Bayes' theorem|Bayes]]-optimal solution:<ref name="Rosasco 2004 1063–1076"/><ref>{{cite journal |last=Lin |first=Yi |title=Support Vector Machines and the Bayes Rule in Classification |journal=Data Mining and Knowledge Discovery |date=July 2002 |volume=6 |issue=3 |pages=259–275 |doi=10.1023/A:1015469627679 |s2cid=24759201 |url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref>
: <math>f_b(x) = \begin{cases}
Line 44:
{{Reflist}}
* {{cite journal|last=Evgeniou|first=Theodoros |author2=Massimiliano Pontil |author3=Tomaso Poggio|title=Regularization Networks and Support Vector Machines|journal=Advances in Computational Mathematics|year=2000|volume=13|issue=1|pages=1–50|doi=10.1023/A:1018946025316|s2cid=70866 |url=http://cbcl.mit.edu/projects/cbcl/publications/ps/evgeniou-reviewall.pdf}}
* {{cite web|last=Joachims|first=Thorsten|title=SVMlight|url=http://svmlight.joachims.org/|access-date=2012-05-18|archive-url=https://web.archive.org/web/20150419060655/http://svmlight.joachims.org/|archive-date=2015-04-19}}
* {{cite book|last=Vapnik|first=Vladimir|title=The Nature of Statistical Learning Theory|year=1999|publisher=Springer-Verlag|___location=New York|isbn=978-0-387-98780-4|url=https://books.google.com/books?id=sna9BaxVbj8C&q=vapnik+the+nature+of+statistical+learning+theory&pg=PR7}}
|