Loss functions for classification: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
m Add: citeseerx. Removed URL that duplicated unique identifier. | You can use this bot yourself. Report bugs here. | Activated by User:AManWithNoPlan | via #UCB_toolbar
No edit summary
Line 1:
[[File:BayesConsistentLosses2.jpg|thumb|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]]
 
In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal | last1 = Rosasco | first1 = L. | last2 = De Vito | first2 = E. D. | last3 = Caponnetto | first3 = A. | last4 = Piana | first4 = M. | last5 = Verri | first5 = A. | url = http://web.mit.edu/lrosasco/www/publications/loss.pdf| title = Are Loss Functions All the Same? | doi = 10.1162/089976604773135104 | journal = Neural Computation | volume = 16 | issue = 5 | pages = 1063–1076 | year = 2004 | pmid = 15070510| pmc = | citeseerx = 10.1.1.109.6786 }}</ref> Given <math>\mathcal{X}</math> as the vector space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and ''<math>\mathcal{Y''&nbsp;} =&nbsp; \{–1 -1,1 \}</math> as the vector spaceset of alllabels (possible outputs), wea wishtypical goal of classification algorithms is to find a function <math>f: \mathcal{X} \mapsto \mathbb{R}</math> which best mapspredicts a label <math>\vec{x}y</math> tofor a given input <math>y\vec{x}</math>.<ref name="penn">{{Citation | last= Shen | first= Yi | title= Loss Functions For Binary Classification and Class Probability Estimation | publisher= University of Pennsylvania | year= 2005 | url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf | accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation | last= Rosasco | first= Lorenzo | last2= Poggio | first2= Tomaso | title= A Regularization Tour of Machine Learning | series= MIT-9.520 Lectures Notes | volume= Manuscript | year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as
:<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) p(\vec{x},y) \, d\vec{x} \, dy</math>
where <math>V(f(\vec{x}),y)</math> is thea given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as
 
:<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math>
 
ForWithin computational easeclassification, itseveral is standard practice tocommonly writeused [[loss functions]] are written solely in terms of the product of the true label <math>y</math> and the predicted label <math>f(\vec{x})</math>. Therefore, they can be defined as functions of only one variable <math>\upsilon=y f(\vec{x})</math>., so Withinthat classification<math>V(f(\vec{x}),y) loss= functions\phi(yf(\vec{x})) are= generally\phi(\upsilon)</math> writtenwith solelya insuitably termschosen of the product of the true classifierfunction <math>y\phi:\mathbb{R}\to\mathbb{R}</math>. andThese theare predictedcalled value'''margin-based loss functions'''. Choosing a margin-based loss function amounts to choosing <math>f(\vec{x})phi</math>. Selection of a loss function within this framework impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk.
 
:<math>V(f(\vec{x}),y)=\phi(yf(\vec{x}))=\phi(\upsilon)</math>
 
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
 
In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
Line 17 ⟶ 13:
:<math>
\begin{align}
I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]
& = \int_Xint_\mathcal{X} \int_Yint_\mathcal{Y} \phi(yf(\vec{x})) p(y\mid\vec{x})p(\vec{x}) \,dy \,d\vec{x} \\[6pt]
& = \int_Xint_\mathcal{X} [\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) p(-1\mid\vec{x})]p(\vec{x})\,d\vec{x} \\[6pt]
& = \int_Xint_\mathcal{X} [\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) (1-p(1\mid\vec{x}))]p(\vec{x})\,d\vec{x}
\end{align}
</math>
Line 40 ⟶ 36:
 
where <math>H</math> indicates the [[Heaviside step function]].
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf | accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci">{{Citation | last= Ramanan | first= Deva | title= Lecture 14 | publisher= UCI ICS273A: Machine Learning | date= 27 February 2008 | url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf | accessdate= 6 December 2014}}</ref> Some of these surrogates are described below.
 
In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid|independently and identically distributed]] sample points
Line 183 ⟶ 179:
== Hinge loss ==
{{main|Hinge loss}}
The hinge loss function is defined with <math>\phi(\upsilon) = \max(0, 1-\upsilon) = [1-\upsilon]_{+}</math>, where <math>[a]_{+} = \max(0,a)</math> is the [[positive part]] function.
The hinge loss function is defined as
 
:<math>V(f(\vec{x}),y) = \max(0, 1-yf(\vec{x})) = |[1 - yf(\vec{x}) |]_{+}.</math>
 
The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>|yf(\vec{x})| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" />