Regularization perspectives on support vector machines: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:44, 8 December 2023 edit Citation bot (talk \| contribs) Bots 5,873,264 edits Add: s2cid. \| Use this bot. Report bugs. \| Suggested by Corvus florensis \| #UCB_webform 576/2500 ← Previous edit		Latest revision as of 06:07, 17 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,873,264 edits Added isbn. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 1746/2277
(2 intermediate revisions by 2 users not shown)
Line 3: Specifically, [[Tikhonov regularization]] algorithms produce a decision boundary that minimizes the average training-set error and constrain the [[Decision boundary]] not to be excessively complicated or overfit the training data via a L2 norm of the weights term. The training and test-set errors can be measured without bias and in a fair way using accuracy, precision, Auc-Roc, precision-recall, and other metrics. Regularization perspectives on support-vector machines interpret SVM as a special case of Tikhonov regularization, specifically Tikhonov regularization with the [[hinge loss]] for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to [[generalize]] without [[overfitting]]. SVM was first proposed in 1995 by [[Corinna Cortes]] and [[Vladimir Vapnik]], and framed geometrically as a method for finding [[hyperplane]]s that can separate [[multidimensional]] data into two categories.<ref>{{cite journal \|last=Cortes \|first=Corinna \|author2=Vladimir Vapnik \|title=Support-Vector Networks \|journal=Machine Learning \|year=1995 \|volume=20 \|issue=3 \|pages=273–297 \|doi=10.1007/BF00994018 \|doi-access=free }}</ref> This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other [[machine-learning]] techniques for avoiding overfitting, like [[regularization (mathematics)\|regularization]], [[early stopping]], [[sparsity]] and [[Bayesian inference]]. However, once it was discovered that SVM is also a [[special case]] of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.<ref name="rosasco1">{{cite web \|last=Rosasco \|first=Lorenzo \|title=Regularized Least-Squares and Support Vector Machines \|url=https://www.mit.edu/~9.520/spring12/slides/class06/class06_RLSSVM.pdf}}</ref><ref>{{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref><ref name="Lee 2012 67–81">{{cite journal \|last1=Lee \|first1=Yoonkyung \|author1-link= Yoonkyung Lee \|first2=Grace \|last2=Wahba \|author2-link=Grace Wahba \|title=Multicategory Support Vector Machines \|journal=Journal of the American Statistical Association \|year=2012 \|volume=99 \|issue=465 \|pages=67–81 \|doi=10.1198/016214504000000098 \|s2cid=261035640 \|citeseerx=10.1.1.22.1879 }}</ref> This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.<ref name="Rosasco 2004 1063–1076">{{cite journal \|author=Rosasco L. \|author2=De Vito E. \|author3=Caponnetto A. \|author4=Piana M. \|author5=Verri A. \|title=Are Loss Functions All the Same \|journal=Neural Computation \|date=May 2004 \|volume=16 \|issue=5 \|series=5 \|pages=1063–1076 \|doi=10.1162/089976604773135104 \|pmid=15070510\|citeseerx=10.1.1.109.6786 \|s2cid=11845688 }}</ref> ==Theoretical background== Line 12: where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine-learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V \colon \mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>\\|\cdot\\|_\mathcal H</math> is a [[norm (mathematics)\|norm]] on the hypothesis space of functions, and <math>\lambda \in \mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal \|last=Wahba \|first=Grace \|author2=Yonghua Wang \|title=When is the optimal regularization parameter insensitive to the choice of the loss function \|journal=Communications in Statistics – Theory and Methods \|year=1990 \|volume=19 \|issue=5 \|pages=1685–1700 \|doi=10.1080/03610929008830285 }}</ref> When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K \colon \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n \times n</math> [[symmetric]] [[Positive-definite kernel\|positive-definite]] [[matrix (mathematics)\|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref>~~See~~ {{cite book \|last=Scholkopf \|first=Bernhard \|author2=Ralf Herbrich \|author3=Alex Smola \|title=Computational Learning Theory \|chapter=A Generalized Representer Theorem \|journal=Computational Learning Theory: Lecture Notes in Computer Science \|year=2001 \|volume=2111 \|pages=416–426 \|doi=10.1007/3-540-44581-1_27 \|series=Lecture Notes in Computer Science \|isbn=978-3-540-42343-0 \|citeseerx=10.1.1.42.8617 }}</ref>conference \| last1 = Schölkopf \| first1 = Bernhard \| last2 = Herbrich \| first2 = Ralf \| last3 = Smola \| first3 = Alexander J. \| editor1-last = Helmbold \| editor1-first = David P. \| editor2-last = Williamson \| editor2-first = Robert C. \| contribution = A generalized representer theorem \| doi = 10.1007/3-540-44581-1_27 \| pages = 416–426 \| publisher = Springer \| series = Lecture Notes in Computer Science \| title = Computational Learning Theory, 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 16–19, 2001, Proceedings \| volume = 2111 \| year = 2001\| isbn = 978-3-540-42343-0 }}</ref> : <math>f(x_i) = \sum_{j=1}^n c_j \mathbf K_{ij}, \text{ and } \\|f\\|^2_{\mathcal H} = \langle f, f\rangle_\mathcal H = \sum_{i=1}^n \sum_{j=1}^n c_i c_jK(x_i, x_j) = c^T \mathbf K c.</math>