Revision as of 21:37, 3 January 2019 edit Mikhail Ryazanov (talk \| contribs) Extended confirmed users 24,668 edits m punct., fmt., style ← Previous edit		Revision as of 18:49, 4 January 2019 edit undo Michael Hardy (talk \| contribs) Administrators 210,597 edits No edit summary Next edit →
Line 7: In the [[statistical learning theory]] framework, an [[algorithm]] is a strategy for choosing a [[function (mathematics)\|function]] <math>f \colon \mathbf X \to \mathbf Y </math> given a training set <math>S = \{(x_1, y_1), \ldots, (x_n, y_n)\}</math> of inputs <math>x_i</math> and their labels <math>y_i</math> (the labels are usually <math>\pm1</math>). [[Regularization (mathematics)\|Regularization]] strategies avoid [[overfitting]] by choosing a function that fits the data, but is not too complex. Specifically: : <math>f = \~~operatorname{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin}\left\{\frac{ 1}{ n} \sum_{i=1}^n V(y_i, f(x_i)) + \lambda\\|f\\|^2_\mathcal{H}\right\},</math> where <math>\mathcal{H}</math> is a [[hypothesis space]]<ref>A hypothesis space is the set of functions used to model the data in a machine-learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a [[Hilbert space]] of functions with norm formed from the loss function.</ref> of functions, <math>V \colon \mathbf Y \times \mathbf Y \to \mathbb R</math> is the loss function, <math>\\|\cdot\\|_\mathcal H</math> is a [[norm (mathematics)\|norm]] on the hypothesis space of functions, and <math>\lambda \in \mathbb R</math> is the [[regularization parameter]].<ref>For insight on choosing the parameter, see, e.g., {{cite journal \|last=Wahba \|first=Grace \|author2=Yonghua Wang \|title=When is the optimal regularization parameter insensitive to the choice of the loss function \|journal=Communications in Statistics – Theory and Methods \|year=1990 \|volume=19 \|issue=5 \|pages=1685–1700 \|doi=10.1080/03610929008830285 \|url=http://www.tandfonline.com/doi/abs/10.1080/03610929008830285}}</ref> When <math>\mathcal{H}</math> is a [[reproducing kernel Hilbert space]], there exists a [[kernel function]] <math>K \colon \mathbf X \times \mathbf X \to \mathbb R</math> that can be written as an <math>n \times n</math> [[symmetric]] [[Positive-definite kernel\|positive-definite]] [[matrix (mathematics)\|matrix]] <math>\mathbf K</math>. By the [[representer theorem]],<ref>See {{cite journal \|last=Scholkopf \|first=Bernhard \|author2=Ralf Herbrich \|author3=Alex Smola \|title=A Generalized Representer Theorem \|journal=Computational Learning Theory: Lecture Notes in Computer Science \|year=2001 \|volume=2111 \|pages=416–426 \|doi=10.1007/3-540-44581-1_27 \|url=http://www.springerlink.com/content/v1tvba62hd4837h9/?MUD=MP}}</ref> : <math>f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}~~</math>~~, \text{ and ~~<math>~~} \\|f\\|^2_{\mathcal H} = \langle f, f\rangle_\mathcal H = \sum_{i=1}^n \sum_{j=1}^n c_i c_jK(x_i, x_j) = c^T \mathbf K c.</math> ==Special properties of the hinge loss== Line 18 ⟶ 19: The simplest and most intuitive loss function for categorization is the misclassification loss, or 0–1 loss, which is 0 if <math>f(x_i) = y_i</math> and 1 if <math>f(x_i) \neq y_i</math>, i.e. the [[Heaviside step function]] on <math>-y_if(x_i)</math>. However, this loss function is not [[convex function\|convex]], which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0–1 loss. The hinge loss, <math>V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+</math>, where <math>(s)_+ = \max(s, 0)</math>, provides such a [[convex relaxation]]. In fact, the hinge loss is the tightest convex [[upper bound]] to the 0–1 misclassification loss function,<ref name="Lee 2012 67–81"/> and with infinite data returns the [[Bayes' theorem\|Bayes]]-optimal solution:<ref name="Rosasco 2004 1063–1076"/><ref>{{cite journal \|last=Lin \|first=Yi \|title=Support Vector Machines and the Bayes Rule in Classification \|journal=Data Mining and Knowledge Discovery \|date=July 2002 \|volume=6 \|issue=3 \|pages=259–275 \|doi=10.1023/A:1015469627679 \|url=http://cbio.ensmp.fr/~jvert/svn/bibli/local/Lin2002Support.pdf}}</ref> : <math>f_b(x) = \begin{cases} 1, & p(1\|\mid x) > p(-1\|\mid x), \\ -1, & p(1\|\mid x) < p(-1\|\mid x). \end{cases}</math> Line 27 ⟶ 28: The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.<ref>For a detailed derivation, see {{cite book \|last=Rifkin \|first=Ryan \|title=Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning \|year=2002 \|publisher=MIT (PhD thesis) \|url=http://web.mit.edu/~9.520/www/Papers/thesis-rifkin.pdf}}</ref> With the hinge loss : <math>V\big(y_i, f(x_i)\big) = \big(1 - yf(x)\big)_+,</math> where <math>(s)_+ = \max(s, 0)</math>, the regularization problem becomes : <math>f = \~~operatorname{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{\frac{ 1}{ n} \sum_{i=1}^n \big(1 - yf(x)\big)_+ + \lambda\\|f\\|^2_\mathcal{H}\right\}.</math> Multiplying by <math>1/(2\lambda)</math> yields : <math>f = \~~operatorname{arg}\min_~~underset{f\in\mathcal{H}} \operatorname{argmin} \left\{C\sum_{i=1}^n \big(1 - yf(x)\big)_+ + \frac{1}{2}\\|f\\|^2_\mathcal{H}\right\}</math> with <math>C = 1/(2\lambda n)</math>, which is equivalent to the standard SVM minimization problem.

Regularization perspectives on support vector machines: Difference between revisions