Content deleted Content added
Added classification calibrated and Bayes consistency. |
added how to generate Bayes consistent losses |
||
Line 16:
as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.)
For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable <math>\upsilon=y f(\vec{x})</math>. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>.
:<math>V(f(\vec{x}),y)=\phi(
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
Line 25 ⟶ 27:
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf | accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci"> {{Citation | last= Ramanan | first= Deva | title= Lecture 14 | publisher= UCI ICS273A: Machine Learning | date= 27 February 2008 | url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf | accessdate= 6 December 2014}}</ref> Some of these surrogates are described below.
==Bayes Consistency==
Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^*_{0/1}</math>
:<math>f^*_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>.
A loss function <math>\phi(yf(\vec{x}))</math>is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{\phi}(\vec{x}) = \operatorname{sgn}(f^*_{0/1}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions. For convex <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math>is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)=0</math><ref>{{Cite journal|last=Bartlett|first=Peter L.|last2=Jordan|first2=Michael I.|last3=Mcauliffe|first3=Jon D.|date=2006|title=Convexity, Classification, and Risk Bounds|url=https://www.jstor.org/stable/30047445|journal=Journal of the American Statistical Association|volume=101|issue=473|pages=138–156|issn=0162-1459}}</ref><ref name="mit" />. Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation <ref name="robust"> {{Citation|last=Masnadi-Shirazi|first=Hamed|title=On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost|url=http://www.svcl.ucsd.edu/publications/conference/2008/nips08/NIPS08LossesWITHTITLE.pdf|publisher=Statistical Visual Computing Laboratory, University of California, San Diego|accessdate=6 December 2014|last2=Vasconcelos|first2=Nuno}}</ref>
<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>.
==Simplifying expected risk for classification==
|