Loss functions for classification: Difference between revisions

Content deleted Content added
Added classification calibrated and Bayes consistency.
added how to generate Bayes consistent losses
Line 16:
as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.)
 
For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable <math>\upsilon=y f(\vec{x})</math>. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>.<ref name="robust"> {{Citation | last= Masnadi-Shirazi | first= Hamed | last2= Vasconcelos | first2= Nuno | title= On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost | publisher= Statistical Visual Computing Laboratory, University of California, San Diego | url= http://www.svcl.ucsd.edu/publications/conference/2008/nips08/NIPS08LossesWITHTITLE.pdf | accessdate= 6 December 2014}}</ref> Selection of a loss function within this framework
 
:<math>V(f(\vec{x}),y)=\phi(-yf(\vec{x}))=\phi(\upsilon)</math>
 
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
 
Line 25 ⟶ 27:
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf | accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci"> {{Citation | last= Ramanan | first= Deva | title= Lecture 14 | publisher= UCI ICS273A: Machine Learning | date= 27 February 2008 | url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf | accessdate= 6 December 2014}}</ref> Some of these surrogates are described below.
 
==Bayes Consistency==
==Bounds for classification==
Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^*_{0/1}</math>, which minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule, for a binary classification problem and is in the form of
:<math>f^*_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>.
 
A loss function <math>\phi(yf(\vec{x}))</math>is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{\phi}(\vec{x}) = \operatorname{sgn}(f^*_{0/1}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions. For convex <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math>is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)=0</math><ref>{{Cite journal|last=Bartlett|first=Peter L.|last2=Jordan|first2=Michael I.|last3=Mcauliffe|first3=Jon D.|date=2006|title=Convexity, Classification, and Risk Bounds|url=https://www.jstor.org/stable/30047445|journal=Journal of the American Statistical Association|volume=101|issue=473|pages=138–156|issn=0162-1459}}</ref><ref name="mit" />. Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation <ref name="robust"> {{Citation|last=Masnadi-Shirazi|first=Hamed|title=On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost|url=http://www.svcl.ucsd.edu/publications/conference/2008/nips08/NIPS08LossesWITHTITLE.pdf|publisher=Statistical Visual Computing Laboratory, University of California, San Diego|accessdate=6 December 2014|last2=Vasconcelos|first2=Nuno}}</ref>
 
<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]</math>,
 
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>.
A loss function <math>\phi(-yf(\vec{x}))</math>is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{\phi}(\vec{x}) = \operatorname{sgn}(f^*(\vec{x}))</math>and is thus equivalent to the Bayes optimal decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function by directly minimizing the expected risk and without having to explicitly model the probability density functions. Furthermore, it can be shown that for any convex loss function <math>V(yf_0(\vec{x}))</math>, where <math>f_0</math> is the function that minimizes this loss, if <math>f_0(\vec{x}) \ne 0</math> and <math>V</math> is decreasing in a neighborhood of 0, then <math>f^*(\vec{x}) = \operatorname{sgn}(f_0(\vec{x}))</math>
where <math>\operatorname{sgn}</math> is the [[sign function]] (for proof see <ref>{{Cite journal|last=Bartlett|first=Peter L.|last2=Jordan|first2=Michael I.|last3=Mcauliffe|first3=Jon D.|date=2006|title=Convexity, Classification, and Risk Bounds|url=https://www.jstor.org/stable/30047445|journal=Journal of the American Statistical Association|volume=101|issue=473|pages=138–156|issn=0162-1459}}</ref>). Note also that <math>f_0(\vec{x}) \ne 0</math> in practice when the loss function is differentiable at the origin.
This fact confers a consistency property upon all convex loss functions; specifically, all convex loss functions will lead to consistent results with the 0–1 loss function given the presence of infinite data. Consequently, we can bound the difference of any of these convex loss function from expected risk.<ref name="mit" />
 
==Simplifying expected risk for classification==