Revision as of 15:26, 25 July 2019 edit 64.58.145.95 (talk) added "margin losses". Tag: Visual edit ← Previous edit		Revision as of 16:43, 25 July 2019 edit undo 64.58.145.95 (talk) Added classification calibrated and Bayes consistency. Tag: Visual edit Next edit →
Line 18: For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>.<ref name="robust"> {{Citation \| last= Masnadi-Shirazi \| first= Hamed \| last2= Vasconcelos \| first2= Nuno \| title= On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost \| publisher= Statistical Visual Computing Laboratory, University of California, San Diego \| url= http://www.svcl.ucsd.edu/publications/conference/2008/nips08/NIPS08LossesWITHTITLE.pdf \| accessdate= 6 December 2014}}</ref> Selection of a loss function within this framework :<math>V(f(\vec{x}),y)=\phi(-yf(\vec{x}))</math> impacts the optimal <math>f^{}_S_\phi</math> which ~~[[empirical risk minimization \|~~minimizes ~~empirical risk]], as well as~~ the ~~computational~~expected ~~complexity of the learning algorithm~~risk. Loss functions in this form are known as ''margin losses''. Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by Line 26: ==Bounds for classification== Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^</math>, which implements the Bayes optimal decision rule, for a binary classification problem is ~~equivalent~~in the form toof :<math>f^(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>. ~~(when <math>p(1\mid\vec{x}) \ne p(-1\mid\vec{x})</math>).~~ A loss function <math>\phi(-yf(\vec{x}))</math>is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^_{\phi}</math> is such that <math>f^_{\phi}(\vec{x}) = \operatorname{sgn}(f^(\vec{x}))</math>and is thus equivalent to the Bayes optimal decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function by directly minimizing the expected risk and without having to explicitly model the probability density functions. Furthermore, it can be shown that for any convex loss function <math>V(yf_0(\vec{x}))</math>, where <math>f_0</math> is the function that minimizes this loss, if <math>f_0(\vec{x}) \ne 0</math> and <math>V</math> is decreasing in a neighborhood of 0, then <math>f^*(\vec{x}) = \operatorname{sgn}(f_0(\vec{x}))</math> where <math>\operatorname{sgn}</math> is the [[sign function]] (for proof see <ref>{{Cite ~~name~~journal\|last=~~"mit"~~Bartlett\|first=Peter L.\|last2=Jordan\|first2=Michael I.\|last3=Mcauliffe\|first3=Jon D.\|date=2006\|title=Convexity, Classification, and Risk Bounds\|url=https://www.jstor.org/stable/30047445\|journal=Journal of the American Statistical Association\|volume=101\|issue=473\|pages=138–156\|issn=0162-1459}}</ref>). Note also that <math>f_0(\vec{x}) \ne 0</math> in practice when the loss function is differentiable at the origin. This fact confers a consistency property upon all convex loss functions; specifically, all convex loss functions will lead to consistent results with the 0–1 loss function given the presence of infinite data. Consequently, we can bound the difference of any of these convex loss function from expected risk.<ref name="mit" />

Loss functions for classification: Difference between revisions