Content deleted Content added
added Differentiable computing navbox |
Maxeto0910 (talk | contribs) |
||
(35 intermediate revisions by 22 users not shown) | |||
Line 1:
{{Short description|Concept in machine learning}}
{{Machine learning}}
[[File:BayesConsistentLosses2.jpg|thumb|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]]
{{Attention|reason=Discuss the difference compared to scoring rules|date=January 2024}}
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| citeseerx = 10.1.1.109.6786 | s2cid = 11845688 }}</ref> Given <math>\mathcal{X}</math> as the space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and <math>\mathcal{Y} = \{ -1,1 \}</math> as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \
:<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math>
where <math>V(f(\vec{x}),y)</math> is a given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as
:<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math>
Within classification, several commonly used [[loss functions]] are written solely in terms of the product of the true label <math>y</math> and the predicted label <math>f(\vec{x})</math>. Therefore, they can be defined as functions of only one variable <math>\upsilon=y f(\vec{x})</math>, so that <math>V(f(\vec{x}),y) = \phi(yf(\vec{x})) = \phi(\upsilon)</math> with a suitably chosen function <math>\phi:\mathbb{R}\to\mathbb{R}</math>. These are called '''margin-based loss functions'''. Choosing a margin-based loss function amounts to choosing <math>\phi</math>. Selection of a loss function within this framework impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk, see [[empirical risk minimization]].
In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
Line 13 ⟶ 15:
:<math>
\begin{align}
I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]
& = \int_\mathcal{X} \int_\mathcal{Y} \phi(yf(\vec{x})) \, p(y\mid\vec{x}) \, p(\vec{x}) \,dy \,d\vec{x} \\[6pt]
& = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, p(-1\mid\vec{x})]\, p(\vec{x})\,d\vec{x} \\[6pt]
& = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, (1-p(1\mid\vec{x}))]\, p(\vec{x})\,d\vec{x}
\end{align}
</math>
Line 26 ⟶ 28:
One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation
:<math>\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0, \;\;\;\;\;(1)</math>{{Citation needed|date=February 2023}}{{Clarify|reason=What is η?|date=February 2023}}▼
:<math>▼
▲\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1)
which is also equivalent to setting the derivative of the conditional risk equal to zero.▼
\eta=p(y=1|\vec{x})
▲</math>, which is also equivalent to setting the derivative of the conditional risk equal to zero.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
Line 36 ⟶ 38:
where <math>H</math> indicates the [[Heaviside step function]].
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url=
In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid|independently and identically distributed]] sample points
Line 55 ⟶ 57:
A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{0/1}(\vec{x}) = \operatorname{sgn}(f^*_{\phi}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions.
For convex margin loss <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math> is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math> is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math> and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.<ref name=":0" /><ref>{{Cite
{| class="wikitable"
|+Table-I
Line 102 ⟶ 104:
==Proper loss functions, loss margin and regularization==
[[File:LogitLossMarginWithMu.jpg|alt=|thumb|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>)
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal|last1=Vasconcelos|first1=Nuno|last2=Masnadi-Shirazi|first2=Hamed|date=2015|title=A View of Margin Losses as Regularizers of Probability Estimates|url=http://jmlr.org/papers/v16/masnadi15a.html|journal=Journal of Machine Learning Research|volume=16|issue=85|pages=2751–2795|issn=1533-7928}}</ref> Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math> where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[gradient boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math> improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>.
Line 137 ⟶ 139:
This function is undefined when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math> (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when <math>p(1\mid x)</math> increases and equals 0 when <math>p(1\mid x)= 0.5</math>.<ref name="mitlec" />
It's easy to check that the logistic loss and binary [[cross
== Exponential loss ==
The exponential loss function can be generated using (2) and Table-I as follows
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{\left(\frac{e^{2v}}{1+e^{2v}}\right)\left(1-\frac{e^{2v}}{1+e^{2v}}\right)}+\left(1-\frac{e^{2v}}{1+e^{2v}}\right)\left(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}\right) = e^{-v}</math>
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The
The minimizer of <math>I[f]</math> for the exponential loss function can be directly found from equation (1) as
Line 153 ⟶ 155:
The Savage loss<ref name=":0" /> can be generated using (2) and Table-I as follows
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = \left(\frac{e^v}{1+e^v}\right)\left(1-\frac{e^v}{1+e^v}\right)+\left(1-\frac{e^v}{1+e^v}\right)\left(1-\frac{2e^v}{1+e^v}\right) = \frac{1}{(1+e^v)^2}.</math>
The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in [[gradient boosting]] and the SavageBoost algorithm.
Line 162 ⟶ 164:
== Tangent loss ==
The Tangent loss<ref>{{Cite
:<math>
\begin{align}
\phi(v) & = C[f^{-1}(v)]+\left( 1-f^{-1}(v)\right) C'[f^{-1}(v)]
\\ & = 4 \left( \arctan(v)+\frac{1}{2} \right) \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) + \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) \left( 4-8 \left( \arctan(v)+\frac{1}{2} \right) \right) \\
& = \left( 2\arctan(v)-1 \right) ^2.
\end{align}
</math>
The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in [[gradient boosting]], the TangentBoost algorithm and Alternating Decision Forests.<ref>{{Cite
The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as
:<math>f^*_\text{Tangent}= \tan \left( \eta-\frac{1}{2} \right) =\tan \left( p \left( 1\mid x \right) -\frac{1}{2}\right) .</math>
== Hinge loss ==
Line 205 ⟶ 208:
== See also ==
*[[Differentiable programming]]
*[[Scoring function]]
== References ==
{{Reflist}}
{{Artificial intelligence navbox}}
[[Category:Machine learning algorithms]]
|