Content deleted Content added
Holydiver4 (talk | contribs) No edit summary |
Holydiver4 (talk | contribs) Changed order of sections |
||
Line 1:
[[File:Loss function surrogates.svg|thumb|Plot of various functions. Blue is the 0–1 indicator function. Green is the square loss function. Purple is the hinge loss function. Yellow is the logistic loss function. Note that all surrogates give a loss penalty of 1 for {{math|''y''{{=}}''f''(''x''{{=}} 0) }}]]
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| pmc = | citeseerx = 10.1.1.109.6786 }}</ref> Given <math>X</math> as the vector space of all possible inputs, and ''Y'' = {–1,1} as the vector space of all possible outputs, we wish to find a function <math>f: X \mapsto \mathbb{R}</math> which best maps <math>\vec{x}</math> to <math>y</math>.<ref name="penn">{{Citation | last= Shen | first= Yi | title= Loss Functions For Binary Classification and Class Probability Estimation | publisher= University of Pennsylvania | year= 2005 | url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf | accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation | last= Rosasco | first= Lorenzo | last2= Poggio | first2= Tomaso | title= A Regularization Tour of Machine Learning | series= MIT-9.520 Lectures Notes | volume= Manuscript | year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected risk, defined as
:<math>I[f] = \displaystyle \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \, d\vec{x} \, dy</math>
Line 7 ⟶ 6:
:<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math>
For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable <math>\upsilon=y f(\vec{x})</math>. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>. Selection of a loss function within this framework▼
In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid |independently and identically distributed]] sample points▼
:<math>
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''. ▼
drawn from the data [[sample space]], one seeks to [[empirical risk minimization | minimize empirical risk]]▼
Given the properties of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically, ▼
:<math>I_S[f] = \frac{1}{n} \sum_{i=1}^n V( f(\vec{x}_i),y_i)</math>▼
:<math>▼
as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.)▼
\begin{align}▼
I[f] & = \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]▼
& = \int_X [
& = \int_X [
\end{align}▼
</math>▼
The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for <math>y</math>, and the fourth because <math>p(-1\mid x)=1-p(1\mid x)</math>.▼
▲For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable <math>\upsilon=y f(\vec{x})</math>. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>. Selection of a loss function within this framework
As a result, one can solve for the minimizers of <math>I[f]</math> for any convex loss functions with these properties by differentiating the last equality with respect to <math>f</math> and setting the derivative equal to 0. Thus, minimizers for all of the loss function surrogates described below are easily obtained as functions of only <math>f(\vec{x})</math> and <math>p(1\mid x)</math>.<ref name="mitlec" />▼
:<math>V(f(\vec{x}),y)=\phi(yf(\vec{x}))=\phi(\upsilon)</math>▼
▲impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
:<math>V(f(\vec{x}),y)=H(-yf(\vec{x}))</math>
where <math>H</math> indicates the [[Heaviside step function]].
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf | accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci"> {{Citation | last= Ramanan | first= Deva | title= Lecture 14 | publisher= UCI ICS273A: Machine Learning | date= 27 February 2008 | url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf | accessdate= 6 December 2014}}</ref> Some of these surrogates are described below.
▲In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid |independently and identically distributed]] sample points
▲drawn from the data [[sample space]], one seeks to [[empirical risk minimization | minimize empirical risk]]
▲:<math>I_S[f] = \frac{1}{n} \sum_{i=1}^n V( f(\vec{x}_i),y_i)</math>
▲as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.)
==Bayes Consistency==
Line 38 ⟶ 52:
<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some different choices of <math>C(\eta)</math>and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification<ref name="robust" /><ref>{{Cite journal|last=Leistner|first=C.|last2=Saffari|first2=A.|last3=Roth|first3=P. M.|last4=Bischof|first4=H.|date=2009-9|title=On robustness of on-line boosting - a competitive study|url=https://ieeexplore.ieee.org/document/5457451|journal=2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops|pages=1362–1369|doi=10.1109/ICCVW.2009.5457451}}</ref>. For such loss functions, the posterior probability <math>p(y=1|\vec{x})</math> can be derived using the invertible ''link function'' as <math>p(y=1|\vec{x})=\eta=f^{-1}(v)</math>.
{| class="wikitable"
|+Table-I
Line 70 ⟶ 84:
|<math>4\eta(1-\eta)</math>
|<math>\arctan(v)+\frac{1}{2}</math>
|}<br />
▲Given the properties of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
▲:<math>
▲\begin{align}
▲I[f] & = \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]
▲& = \int_X \int_Y V(-yf(\vec{x})) p(y\mid\vec{x})p(\vec{x}) \,dy \,d\vec{x} \\[6pt]
▲& = \int_X [V(-f(\vec{x})) p(1\mid\vec{x})+V(f(\vec{x})) p(-1\mid\vec{x})]p(\vec{x})\,d\vec{x} \\[6pt]
▲& = \int_X [V(-f(\vec{x})) p(1\mid\vec{x})+V(f(\vec{x})) (1-p(1\mid\vec{x}))]p(\vec{x})\,d\vec{x}
▲\end{align}
▲</math>
▲The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for <math>y</math>, and the fourth because <math>p(-1\mid x)=1-p(1\mid x)</math>.
▲As a result, one can solve for the minimizers of <math>I[f]</math> for any convex loss functions with these properties by differentiating the last equality with respect to <math>f</math> and setting the derivative equal to 0. Thus, minimizers for all of the loss function surrogates described below are easily obtained as functions of only <math>f(\vec{x})</math> and <math>p(1\mid x)</math>.<ref name="mitlec" />
== Square loss ==
While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. Defined as
|