Loss functions for classification: Difference between revisions

Content deleted Content added
No edit summary
added derivation of Square loss
Line 104:
</math>. This holds even for the nonconvex loss functions which means that gradient descent based algorithms such as [[Gradient boosting|Gradient Boosting]] can be used to construct the minimizer in practice with finite training samples.
== Square loss ==
While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. DefinedIt can be generated using (2) and Table-I as follows
:<math>V\phi(v)=C[f^{-1}(\vecv)]+(1-f^{x-1}(v),y)C'[f^{-1}(v)] = 4(\frac{1}{2}(v+1))(1-yf\frac{1}{2}(v+1))+(1-\vecfrac{x1}{2}(v+1))(4-8(\frac{1}{2}(v+1)))=(1-v)^2.</math>
 
theThe square loss function is both convex and smooth and matches the 0–1 [[indicator function]] when <math>yf(\vec{x})= 0</math> and when <math>yf(\vec{x}) = 1</math>. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.<ref name="mit" /> In addition, functions which yield high values of <math>f(\vec{x})</math> for some <math>x \in X</math> will perform poorly with the square loss function, since high values of <math>yf(\vec{x})</math> will be penalized severely, regardless of whether the signs of <math>y</math> and <math>f(\vec{x})</math> match.
 
A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for [[Tikhonov regularization]], one can solve for the regularization parameter using leave-one-out [[cross-validation (statistics) |cross-validation]] in the same time as it would take to solve a single problem.<ref>{{Citation| last= Rifkin| first= Ryan M.| last2= Lippert| first2= Ross A.| title= Notes on Regularized Least Squares| publisher= MIT Computer Science and Artificial Intelligence Laboratory| date= 1 May 2007|url=https://dspace.mit.edu/bitstream/handle/1721.1/37318/MIT-CSAIL-TR-2007-025.pdf?sequence=1}}</ref>
 
The minimizer of <math>I[f]</math> for the square loss function is
:<math>f^*_\text{Square}= 2\eta-1=2p(1\mid x)-1</math>
This function notably equals <math>f^*</math> for the 0–1 loss function when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math>, but predicts a value between the two classifications when the classification of <math>\vec{x}</math> is not known with absolute certainty.
 
== Hinge loss ==
Line 122:
The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>|yf(\vec{x})| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" />
 
While the hinge loss function is both convex and continuous, it is not smooth (that is not differentiable) at <math>yf(\vec{x})=1</math>. Consequently, the hinge loss function cannot be used with [[gradient descent]] methods or [[stochastic gradient descent]] methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at <math>yf(\vec{x})=1</math>, which allows for the utilization of [[subgradient method | subgradient descent methods]].<ref name="Utah" /> SVMs utilizing the hinge loss function can also be solved using [[quadratic programming]].
 
The minimizer of <math>I[f]</math> for the hinge loss function is
Line 128:
:<math>f^*_\text{Hinge}(\vec{x}) \;=\; \begin{cases} 1& \text{if }p(1\mid\vec{x}) > p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>
 
when <math>p(1\mid x) \ne 0.5</math>, which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.<ref name="mit" /> The Hinge loss cannot be derived from (2) since <math>f^*_{\text{Hinge}}</math> is not invertible.
 
== Generalized Smooth Hinge loss ==