Loss functions for classification: Difference between revisions

Content deleted Content added
formatting
Line 1:
[[File:BayesConsistentLosses2.jpg|thumb|Bayes Consistentconsistent Lossesloss functions: Zero-Oneone Lossloss (gray), Savage Lossloss (green), Logistic Lossloss (orange), Exponential Lossloss (purple), Tangent Lossloss (brown), Square Lossloss (blue)]]
 
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| pmc = | citeseerx = 10.1.1.109.6786 }}</ref> Given <math>X</math> as the vector space of all possible inputs, and ''Y''&nbsp;=&nbsp;{–1,1} as the vector space of all possible outputs, we wish to find a function <math>f: X \mapsto \mathbb{R}</math> which best maps <math>\vec{x}</math> to <math>y</math>.<ref name="penn">{{Citation | last= Shen | first= Yi | title= Loss Functions For Binary Classification and Class Probability Estimation | publisher= University of Pennsylvania | year= 2005 | url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf | accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation | last= Rosasco | first= Lorenzo | last2= Poggio | first2= Tomaso | title= A Regularization Tour of Machine Learning | series= MIT-9.520 Lectures Notes | volume= Manuscript | year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected risk, defined as
Line 13:
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
 
GivenIn the propertiescase of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
 
:<math>
Line 30:
One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation
 
:<math>
\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1)
</math>
Line 52:
as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.)
 
==Bayes Consistencyconsistency==
Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^*_{0/1}</math>which, i.e., the one that minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of
 
:<math>f^*_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>.
Line 59:
A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{0/1}(\vec{x}) = \operatorname{sgn}(f^*_{\phi}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions.
 
For convex margin loss <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math> is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)=0</math><ref>{{Cite journal|last=Bartlett|first=Peter L.|last2=Jordan|first2=Michael I.|last3=Mcauliffe|first3=Jon D.|date=2006|title=Convexity, Classification, and Risk Bounds|journal=Journal of the American Statistical Association|volume=101|issue=473|pages=138–156|issn=0162-1459|jstor=30047445|doi=10.1198/016214505000000907}}</ref><ref name="mit" />. Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation <ref name=":0">{{Cite journal|last=Masnadi-Shirazi|first=Hamed|last2=Vasconcelos|first2=Nuno|date=2008|title=On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost|url=http://dl.acm.org/citation.cfm?id=2981780.2981911|journal=Proceedings of the 21st International Conference on Neural Information Processing Systems|series=NIPS'08|___location=USA|publisher=Curran Associates Inc.|pages=1049–1056|isbn=9781605609492}}</ref>
 
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>,
 
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math> and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification<ref name=":0" /><ref>{{Cite journal|last=Leistner|first=C.|last2=Saffari|first2=A.|last3=Roth|first3=P. M.|last4=Bischof|first4=H.|date=2009-9|title=On robustness of on-line boosting - a competitive study|url=https://ieeexplore.ieee.org/document/5457451|journal=2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops|pages=1362–1369|doi=10.1109/ICCVW.2009.5457451|isbn=978-1-4244-4442-7}}</ref>. For all loss functions generated from (2) , the posterior probability <math>p(y=1|\vec{x})</math> can be found using the invertible ''link function'' as <math>p(y=1|\vec{x})=\eta=f^{-1}(v)</math>. Such loss functions where the posterior probability can be recovered using the invertible link are called ''proper loss functions''.
{| class="wikitable"
|+Table-I
!Loss Namename
!<math>\phi(v)</math>
!<math>C(\eta)</math>
Line 103:
|}<br />The sole minimizer of the expected risk, <math>f^*_{\phi}</math>, associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the corresponding <math>
f(\eta)
</math>. This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such as [[Gradientgradient boosting|Gradient Boosting]] can be used to construct the minimizer.
 
<br />
==Proper Lossloss Functionsfunctions, Lossloss Marginmargin and Regularizationregularization==
[[File:LogitLossMarginWithMu.jpg|alt=|thumb|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>).]]
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier<ref>{{Cite journal|last=Vasconcelos|first=Nuno|last2=Masnadi-Shirazi|first2=Hamed|date=2015|title=A View of Margin Losses as Regularizers of Probability Estimates|url=http://jmlr.org/papers/v16/masnadi15a.html|journal=Journal of Machine Learning Research|volume=16|issue=85|pages=2751–2795|issn=1533-7928}}</ref>. Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math> where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[Gradientgradient boosting|Gradient Boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math> improves the regularization of the boosted classifier. It should be noted that the theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>.
 
In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.
Line 141:
This function is undefined when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math> (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when <math>p(1\mid x)</math> increases and equals 0 when <math>p(1\mid x)= 0.5</math>.<ref name="mitlec" />
 
It's easy to check that the logistic loss and binary [[cross entropy]] loss (Log loss) are in fact the same (up to a multiplicative constant <math>\frac{1}{\log(2)}</math>).The cross entropy loss is closely related to the [[Kullback-LeiblerKullback–Leibler divergence]] between the empirical distribution and the predicted distribution. The cross entropy loss is ubiquitous in modern [[deep learning|deep neural networks]].
 
== Exponential loss ==
Line 159:
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = (\frac{e^v}{1+e^v})(1-\frac{e^v}{1+e^v})+(1-\frac{e^v}{1+e^v})(1-\frac{2e^v}{1+e^v}) = \frac{1}{(1+e^v)^2}.</math>
 
The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in [[Gradientgradient boosting|Gradient Boosting]] and the SavageBoost algorithm.
 
The minimizer of <math>I[f]</math> for the Savage loss function can be directly found from equation (1) as
Line 168:
The Tangent loss<ref>{{Cite journal|last=Masnadi-Shirazi|first=H.|last2=Mahadevan|first2=V.|last3=Vasconcelos|first3=N.|date=2010-6|title=On the design of robust classifiers for computer vision|url=https://ieeexplore.ieee.org/document/5540136|journal=2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition|pages=779–786|doi=10.1109/CVPR.2010.5540136|isbn=978-1-4244-6984-0}}</ref> can be generated using (2) and Table-I as follows
 
:<math>
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 4(\arctan(v)+\frac{1}{2})(1-(\arctan(v)+\frac{1}{2}))+(1-(\arctan(v)+\frac{1}{2}))(4-8(\arctan(v)+\frac{1}{2})) = (2\arctan(v)-1)^2.</math>
\begin{align}
:<math>\phi(v) & = C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 4(\arctan(v)+\frac{1}{2})(1-(\arctan(v)+\frac{1}{2}))+(1-(\arctan(v)+\frac{1}{2}))(4-8(\arctan(v)+\frac{1}{2})) = (2\arctan(v)-1)^2.</math>\
& = (2\arctan(v)-1)^2.
\end{align}
</math>
 
The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over -training on the data set. The Tangent loss has been used in [[Gradientgradient boosting|Gradient Boosting]], the TangentBoost algorithm and Alternating Decision Forests<ref>{{Cite journal|last=Schulter|first=S.|last2=Wohlhart|first2=P.|last3=Leistner|first3=C.|last4=Saffari|first4=A.|last5=Roth|first5=P. M.|last6=Bischof|first6=H.|date=2013-6|title=Alternating Decision Forests|url=https://ieeexplore.ieee.org/document/6618916|journal=2013 IEEE Conference on Computer Vision and Pattern Recognition|pages=508–515|doi=10.1109/CVPR.2013.72|isbn=978-0-7695-4989-7}}</ref>.
 
The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as
Line 192 ⟶ 197:
when <math>p(1\mid x) \ne 0.5</math>, which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.<ref name="mit" /> The Hinge loss cannot be derived from (2) since <math>f^*_{\text{Hinge}}</math> is not invertible.
 
== Generalized Smoothsmooth Hingehinge loss ==
The generalized smooth hinge loss function with parameter <math>\alpha</math> is defined as
 
:<math>f^*_\alpha(z) \;=\; \begin{cases} \frac{\alpha}{\alpha + 1}& \text{if }z< 0 \\ \frac{1}{\alpha + 1}z^{\alpha + 1} - z + \frac{\alpha}{\alpha + 1} & \text{if } 0<z<1 \\ 0 & \text{if } z \geq 1 \end{cases}.,</math>
 
where
Where
:<math>z = yf(\vec{x}).</math>
 
It is monotonically increasing and reaches 0 when :<math>z = 1</math>.
 
== References ==