Content deleted Content added
No edit summary |
Maxeto0910 (talk | contribs) |
||
(40 intermediate revisions by 25 users not shown) | |||
Line 1:
{{Short description|Concept in machine learning}}
{{Machine learning}}
[[File:BayesConsistentLosses2.jpg|thumb|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]]
{{Attention|reason=Discuss the difference compared to scoring rules|date=January 2024}}
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510
:<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math>
where <math>V(f(\vec{x}),y)</math> is a given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as
:<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math>
Within classification, several commonly used [[loss functions]] are written solely in terms of the product of the true label <math>y</math> and the predicted label <math>f(\vec{x})</math>. Therefore, they can be defined as functions of only one variable <math>\upsilon=y f(\vec{x})</math>, so that <math>V(f(\vec{x}),y) = \phi(yf(\vec{x})) = \phi(\upsilon)</math> with a suitably chosen function <math>\phi:\mathbb{R}\to\mathbb{R}</math>. These are called '''margin-based loss functions'''. Choosing a margin-based loss function amounts to choosing <math>\phi</math>. Selection of a loss function within this framework impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk, see [[empirical risk minimization]].
In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
Line 13 ⟶ 15:
:<math>
\begin{align}
I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]
& = \int_\mathcal{X} \int_\mathcal{Y} \phi(yf(\vec{x})) \, p(y\mid\vec{x}) \, p(\vec{x}) \,dy \,d\vec{x} \\[6pt]
& = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, p(-1\mid\vec{x})]\, p(\vec{x})\,d\vec{x} \\[6pt]
& = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, (1-p(1\mid\vec{x}))]\, p(\vec{x})\,d\vec{x}
\end{align}
</math>
Line 26 ⟶ 28:
One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation
:<math>\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0, \;\;\;\;\;(1)</math>{{Citation needed|date=February 2023}}{{Clarify|reason=What is η?|date=February 2023}}▼
:<math>▼
▲\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1)
which is also equivalent to setting the derivative of the conditional risk equal to zero.▼
\eta=p(y=1|\vec{x})
▲</math>, which is also equivalent to setting the derivative of the conditional risk equal to zero.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
Line 36 ⟶ 38:
where <math>H</math> indicates the [[Heaviside step function]].
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url=
In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid|independently and identically distributed]] sample points
Line 55 ⟶ 57:
A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{0/1}(\vec{x}) = \operatorname{sgn}(f^*_{\phi}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions.
For convex margin loss <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math> is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math> is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math> and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.<ref name=":0" /><ref>{{Cite
{| class="wikitable"
|+Table-I
Line 102 ⟶ 104:
==Proper loss functions, loss margin and regularization==
[[File:LogitLossMarginWithMu.jpg|alt=|thumb|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>)
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal|
In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.
Line 113 ⟶ 115:
The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.<ref name="mit" /> In addition, functions which yield high values of <math>f(\vec{x})</math> for some <math>x \in X</math> will perform poorly with the square loss function, since high values of <math>yf(\vec{x})</math> will be penalized severely, regardless of whether the signs of <math>y</math> and <math>f(\vec{x})</math> match.
A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for [[Tikhonov regularization]], one can solve for the regularization parameter using leave-one-out [[cross-validation (statistics)|cross-validation]] in the same time as it would take to solve a single problem.<ref>{{Citation|
The minimizer of <math>I[f]</math> for the square loss function can be directly found from equation (1) as
Line 137 ⟶ 139:
This function is undefined when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math> (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when <math>p(1\mid x)</math> increases and equals 0 when <math>p(1\mid x)= 0.5</math>.<ref name="mitlec" />
It's easy to check that the logistic loss and binary [[cross
== Exponential loss ==
The exponential loss function can be generated using (2) and Table-I as follows
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{\left(\frac{e^{2v}}{1+e^{2v}}\right)\left(1-\frac{e^{2v}}{1+e^{2v}}\right)}+\left(1-\frac{e^{2v}}{1+e^{2v}}\right)\left(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}\right) = e^{-v}</math>
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The
The minimizer of <math>I[f]</math> for the exponential loss function can be directly found from equation (1) as
Line 153 ⟶ 155:
The Savage loss<ref name=":0" /> can be generated using (2) and Table-I as follows
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = \left(\frac{e^v}{1+e^v}\right)\left(1-\frac{e^v}{1+e^v}\right)+\left(1-\frac{e^v}{1+e^v}\right)\left(1-\frac{2e^v}{1+e^v}\right) = \frac{1}{(1+e^v)^2}.</math>
The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in [[gradient boosting]] and the SavageBoost algorithm.
Line 162 ⟶ 164:
== Tangent loss ==
The Tangent loss<ref>{{Cite
:<math>
\begin{align}
\phi(v) & = C[f^{-1}(v)]+\left( 1-f^{-1}(v)\right) C'[f^{-1}(v)]
\\ & = 4 \left( \arctan(v)+\frac{1}{2} \right) \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) + \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) \left( 4-8 \left( \arctan(v)+\frac{1}{2} \right) \right) \\
& = \left( 2\arctan(v)-1 \right) ^2.
\end{align}
</math>
The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in [[gradient boosting]], the TangentBoost algorithm and Alternating Decision Forests.<ref>{{Cite
The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as
:<math>f^*_\text{Tangent}= \tan \left( \eta-\frac{1}{2} \right) =\tan \left( p \left( 1\mid x \right) -\frac{1}{2}\right) .</math>
== Hinge loss ==
Line 202 ⟶ 205:
It is monotonically increasing and reaches 0 when <math>z = 1</math>.
== See also ==
*[[Differentiable programming]]
*[[Scoring function]]
== References ==
{{Reflist}}
{{Artificial intelligence navbox}}
[[Category:Machine learning algorithms]]
|