Loss functions for classification: Difference between revisions

Content deleted Content added
added table of losses
No edit summary
Line 1:
[[File:BayesConsistentLosses2.jpg|thumb|Bayes Consistent Losses: Zero-One Loss (gray), Savage Loss (green), Logistic Loss (orange), Exponential Loss (purple), Tangent Loss (brown), Square Loss (blue)]]
[[File:Loss function surrogates.svg|thumb|Plot of various functions. Blue is the 0–1 indicator function. Green is the square loss function. Purple is the hinge loss function. Yellow is the logistic loss function. Note that all surrogates give a loss penalty of 1 for {{math|''y''{{=}}''f''(''x''{{=}} 0) }}]]
 
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| pmc = | citeseerx = 10.1.1.109.6786 }}</ref> Given <math>X</math> as the vector space of all possible inputs, and ''Y''&nbsp;=&nbsp;{–1,1} as the vector space of all possible outputs, we wish to find a function <math>f: X \mapsto \mathbb{R}</math> which best maps <math>\vec{x}</math> to <math>y</math>.<ref name="penn">{{Citation | last= Shen | first= Yi | title= Loss Functions For Binary Classification and Class Probability Estimation | publisher= University of Pennsylvania | year= 2005 | url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf | accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation | last= Rosasco | first= Lorenzo | last2= Poggio | first2= Tomaso | title= A Regularization Tour of Machine Learning | series= MIT-9.520 Lectures Notes | volume= Manuscript | year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected risk, defined as
:<math>I[f] = \displaystyle \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \, d\vec{x} \, dy</math>