Content deleted Content added
m Dating maintenance tags: {{Attention}} |
Maxeto0910 (talk | contribs) |
||
(15 intermediate revisions by 9 users not shown) | |||
Line 3:
[[File:BayesConsistentLosses2.jpg|thumb|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]]
{{Attention|reason=Discuss the difference compared to scoring rules|date=January 2024}}
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| citeseerx = 10.1.1.109.6786 | s2cid = 11845688 }}</ref> Given <math>\mathcal{X}</math> as the space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and <math>\mathcal{Y} = \{ -1,1 \}</math> as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \to \mathcal{Y}</math> which best predicts a label <math>y</math> for a given input <math>\vec{x}</math>.<ref name="penn">{{Citation | last= Shen | first= Yi | title= Loss Functions For Binary Classification and Class Probability Estimation | publisher= University of Pennsylvania | year= 2005 | url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf | access-date= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation | last1= Rosasco | first1= Lorenzo | last2= Poggio | first2= Tomaso | title= A Regularization Tour of Machine Learning | series= MIT-9.520 Lectures Notes | volume= Manuscript | year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as
:<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math>
Line 29 ⟶ 28:
One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation
:<math>\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0, \;\;\;\;\;(1)</math>{{Citation needed|date=February 2023}}{{Clarify|reason=What is η?|date=February 2023}}▼
:<math>▼
▲\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1)
which is also equivalent to setting the derivative of the conditional risk equal to zero.▼
\eta=p(y=1|\vec{x})
▲</math>, which is also equivalent to setting the derivative of the conditional risk equal to zero.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
Line 105 ⟶ 104:
==Proper loss functions, loss margin and regularization==
[[File:LogitLossMarginWithMu.jpg|alt=|thumb|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>)
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal|last1=Vasconcelos|first1=Nuno|last2=Masnadi-Shirazi|first2=Hamed|date=2015|title=A View of Margin Losses as Regularizers of Probability Estimates|url=http://jmlr.org/papers/v16/masnadi15a.html|journal=Journal of Machine Learning Research|volume=16|issue=85|pages=2751–2795|issn=1533-7928}}</ref> Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math> where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[gradient boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math> improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>.
Line 147 ⟶ 146:
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{\left(\frac{e^{2v}}{1+e^{2v}}\right)\left(1-\frac{e^{2v}}{1+e^{2v}}\right)}+\left(1-\frac{e^{2v}}{1+e^{2v}}\right)\left(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}\right) = e^{-v}</math>
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The
The minimizer of <math>I[f]</math> for the exponential loss function can be directly found from equation (1) as
Line 169 ⟶ 168:
:<math>
\begin{align}
\phi(v) & = C[f^{-1}(v)]+\left( 1-f^{-1}(v)\right) C'[f^{-1}(v)]
\\ & = 4 \left( \arctan(v)+\frac{1}{2} \right) \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) + \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) \left( 4-8 \left( \arctan(v)+\frac{1}{2} \right) \right) \\
& = \left( 2\arctan(v)-1 \right) ^2.
\end{align}
</math>
Line 178:
The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as
:<math>f^*_\text{Tangent}= \tan \left( \eta-\frac{1}{2} \right) =\tan \left( p \left( 1\mid x \right) -\frac{1}{2}\right) .</math>
== Hinge loss ==
Line 213:
{{Reflist}}
{{Artificial intelligence navbox}}
[[Category:Machine learning algorithms]]
|