Loss functions for classification

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).^[1] Given $X$ as the vector space of all possible inputs, and Y = {–1,1} as the vector space of all possible outputs, we wish to find a function $f:X\mapsto \mathbb {R}$ which best maps ${\vec {x}}$ to $y$ .^[2] However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same ${\vec {x}}$ to generate different $y$ .^[3] As a result, the goal of the learning problem is to minimize expected risk, defined as

I[f]=\displaystyle \int _{X\times Y}V(f({\vec {x}}),y)p({\vec {x}},y)\,d{\vec {x}}\,dy

where $V(f({\vec {x}}),y)$ is the loss function, and $p({\vec {x}},y)$ is the probability density function of the process that generated the data, which can equivalently be written as

p({\vec {x}},y)=p(y\mid {\vec {x}})p({\vec {x}}).

In practice, the probability distribution $p({\vec {x}},y)$ is unknown. Consequently, utilizing a training set of $n$ independently and identically distributed sample points

S=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}

drawn from the data sample space, one seeks to minimize empirical risk

I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f({\vec {x}}_{i}),y_{i})

as a proxy for expected risk.^[3] (See statistical learning theory for a more detailed description.)

For computational ease, it is standard practice to write loss functions as functions of only one variable $\upsilon =yf({\vec {x}})$ . Within classification, loss functions are generally written solely in terms of the product of the true classifier $y$ and the predicted value $f({\vec {x}})$ . Selection of a loss function within this framework

V(f({\vec {x}}),y)=\phi (yf({\vec {x}}))=\phi (\upsilon )

impacts the optimal $f_{\phi }^{*}$ which minimizes the expected risk. Loss functions in this form are known as margin losses.

Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for false positives and false negatives) would be the 0-1 loss function (0–1 indicator function), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by

V(f({\vec {x}}),y)=H(-yf({\vec {x}}))

where $H$ indicates the Heaviside step function. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an NP-hard combinatorial optimization problem.^[4] As a result, it is better to substitute continuous, convex loss function surrogates which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.^[5] Some of these surrogates are described below.

Bayes Consistency

Utilizing Bayes' theorem, it can be shown that the optimal $f_{0/1}^{*}$ which minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of

f_{0/1}^{*}({\vec {x}})\;=\;{\begin{cases}\;\;\;1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\\;\;\;0&{\text{if }}p(1\mid {\vec {x}})=p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}

.

A loss function $\phi (yf({\vec {x}}))$ is said to be classification-calibrated or Bayes consistent if its optimal $f_{\phi }^{*}$ is such that $f_{\phi }^{*}({\vec {x}})=\operatorname {sgn} (f_{0/1}^{*}({\vec {x}}))$ and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function $f_{\phi }^{*}$ by directly minimizing the expected risk and without having to explicitly model the probability density functions. For convex $\phi (\upsilon )$ , it can be shown that $\phi (\upsilon )$ is Bayes consistent if and only if it is differentiable at 0 and $\phi '(0)=0$ ^[6]^[1]. Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation ^[7]

$\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]$ ,

where $f(\eta ),(0\leq \eta \leq 1)$ is any invertible function such that $f^{-1}(-v)=1-f^{-1}(v)$ and $C(\eta )$ is any differentiable strictly concave function such that $C(\eta )=C(1-\eta )$ .

Simplifying expected risk for classification

Given the properties of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,

{\begin{aligned}I[f]&=\int _{X\times Y}V(f({\vec {x}}),y)p({\vec {x}},y)\,d{\vec {x}}\,dy\\[6pt]&=\int _{X}\int _{Y}V(-yf({\vec {x}}))p(y\mid {\vec {x}})p({\vec {x}})\,dy\,d{\vec {x}}\\[6pt]&=\int _{X}[V(-f({\vec {x}}))p(1\mid {\vec {x}})+V(f({\vec {x}}))p(-1\mid {\vec {x}})]p({\vec {x}})\,d{\vec {x}}\\[6pt]&=\int _{X}[V(-f({\vec {x}}))p(1\mid {\vec {x}})+V(f({\vec {x}}))(1-p(1\mid {\vec {x}}))]p({\vec {x}})\,d{\vec {x}}\end{aligned}}

The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for $y$ , and the fourth because $p(-1\mid x)=1-p(1\mid x)$ . As a result, one can solve for the minimizers of $I[f]$ for any convex loss functions with these properties by differentiating the last equality with respect to $f$ and setting the derivative equal to 0. Thus, minimizers for all of the loss function surrogates described below are easily obtained as functions of only $f({\vec {x}})$ and $p(1\mid x)$ .^[3]

Square loss

While more commonly used in regression, the square loss function can be re-written as a function $\phi (yf({\vec {x}}))$ and utilized for classification. Defined as

V(f({\vec {x}}),y)=(1-yf({\vec {x}}))^{2}

the square loss function is both convex and smooth and matches the 0–1 indicator function when $yf({\vec {x}})=0$ and when $yf({\vec {x}})=1$ . However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.^[1] In addition, functions which yield high values of $f({\vec {x}})$ for some $x\in X$ will perform poorly with the square loss function, since high values of $yf({\vec {x}})$ will be penalized severely, regardless of whether the signs of $y$ and $f({\vec {x}})$ match.

A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for Tikhonov regularization, one can solve for the regularization parameter using leave-one-out cross-validation in the same time as it would take to solve a single problem.^[8]

The minimizer of $I[f]$ for the square loss function is

f_{\text{Square}}^{*}=2p(1\mid x)-1

This function notably equals $f^{*}$ for the 0–1 loss function when $p(1\mid x)=1$ or $p(1\mid x)=0$ , but predicts a value between the two classifications when the classification of ${\vec {x}}$ is not known with absolute certainty.

Hinge loss

The hinge loss function is defined as

V(f({\vec {x}}),y)=\max(0,1-yf({\vec {x}}))=|1-yf({\vec {x}})|_{+}.

The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function when $\operatorname {sgn} (f({\vec {x}}))=y$ and $|yf({\vec {x}})|\geq 1$ . In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for support vector machines (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.^[4]

While the hinge loss function is both convex and continuous, it is not smooth (that is not differentiable) at $yf({\vec {x}})=1$ . Consequently, the hinge loss function cannot be used with gradient descent methods or stochastic gradient descent methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at $yf({\vec {x}})=1$ , which allows for the utilization of subgradient descent methods.^[4] SVMs utilizing the hinge loss function can also be solved using quadratic programming.

The minimizer of $I[f]$ for the hinge loss function is

f_{\text{Hinge}}^{*}({\vec {x}})\;=\;{\begin{cases}1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}

when $p(1\mid x)\neq 0.5$ , which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.^[1]

Generalized Smooth Hinge loss

The generalized smooth hinge loss function with parameter $\alpha$ is defined as

f_{\alpha }^{*}(z)\;=\;{\begin{cases}{\frac {\alpha }{\alpha +1}}&{\text{if }}z<0\\{\frac {1}{\alpha +1}}z^{\alpha +1}-z+{\frac {\alpha }{\alpha +1}}&{\text{if }}0<z<1\\0&{\text{if }}z\geq 1\end{cases}}.

Where

z=yf({\vec {x}})

It is monotonically increasing and reaches 0 when : $z=1$

Logistic loss

The logistic loss function is defined as

V(f({\vec {x}}),y)={\frac {1}{\ln 2}}\ln(1+e^{-yf({\vec {x}})})

This function displays a similar convergence rate to the hinge loss function, and since it is continuous, gradient descent methods can be utilized. However, the logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence (i.e., with high values of $|f({\vec {x}})|$ ) are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.

The minimizer of $I[f]$ for the logistic loss function is

f_{\text{Logistic}}^{*}=\ln \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).

This function is undefined when $p(1\mid x)=1$ or $p(1\mid x)=0$ (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when $p(1\mid x)$ increases and equals 0 when $p(1\mid x)=0.5$ .^[3]

Cross entropy loss (Log Loss)

Using the alternative label convention $t=(1+y)/2$ so that $t\in \{0,1\}$ , the binary cross entropy loss is defined as

V(f({\vec {x}}),t)=-t\ln(\sigma ({\vec {x}}))-(1-t)\ln(1-\sigma ({\vec {x}}))

where we introduced the logistic sigmoid:

\sigma ({\vec {x}})={\frac {1}{1+e^{-f({\vec {x}})}}}

It's easy to check that the logistic loss (above) and binary cross entropy are in fact the same (up to a multiplicative constant $1/\ln 2$ ).

The cross entropy loss is closely related to the Kullback-Leibler divergence between the empirical distribution and the predicted distribution. This function is not naturally represented as a product of the true label and the predicted value, but is convex and can be minimized using stochastic gradient descent methods. The cross entropy loss is ubiquitous in modern deep neural networks.

Exponential loss

The exponential loss function is defined as

V(f({\vec {x}}),y)=e^{-\beta yf({\vec {x}})}

It penalizes incorrect predictions more than Hinge loss and has a larger gradient.

References

^ ^a ^b ^c ^d Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786. doi:10.1162/089976604773135104. PMID 15070510.
^ Shen, Yi (2005), Loss Functions For Binary Classification and Class Probability Estimation (PDF), University of Pennsylvania, retrieved 6 December 2014
^ ^a ^b ^c ^d Rosasco, Lorenzo; Poggio, Tomaso (2014), A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes, vol. Manuscript
^ ^a ^b ^c Piyush, Rai (13 September 2011), Support Vector Machines (Contd.), Classification Loss Functions and Regularizers (PDF), Utah CS5350/6350: Machine Learning, retrieved 6 December 2014
^ Ramanan, Deva (27 February 2008), Lecture 14 (PDF), UCI ICS273A: Machine Learning, retrieved 6 December 2014{{citation}}: CS1 maint: publisher ___location (link)
^ Bartlett, Peter L.; Jordan, Michael I.; Mcauliffe, Jon D. (2006). "Convexity, Classification, and Risk Bounds". Journal of the American Statistical Association. 101 (473): 138–156. ISSN 0162-1459.
^ Masnadi-Shirazi, Hamed; Vasconcelos, Nuno, On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost (PDF), Statistical Visual Computing Laboratory, University of California, San Diego, retrieved 6 December 2014
^ Rifkin, Ryan M.; Lippert, Ross A. (1 May 2007), Notes on Regularized Least Squares (PDF), MIT Computer Science and Artificial Intelligence Laboratory

[mit-1] Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786. doi:10.1162/089976604773135104. PMID 15070510.

[penn-2] Shen, Yi (2005), Loss Functions For Binary Classification and Class Probability Estimation (PDF), University of Pennsylvania, retrieved 6 December 2014

[mitlec-3] Rosasco, Lorenzo; Poggio, Tomaso (2014), A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes, vol. Manuscript

[Utah-4] Piyush, Rai (13 September 2011), Support Vector Machines (Contd.), Classification Loss Functions and Regularizers (PDF), Utah CS5350/6350: Machine Learning, retrieved 6 December 2014

[uci-5] Ramanan, Deva (27 February 2008), Lecture 14 (PDF), UCI ICS273A: Machine Learning, retrieved 6 December 2014{{citation}}: CS1 maint: publisher ___location (link)

[6] Bartlett, Peter L.; Jordan, Michael I.; Mcauliffe, Jon D. (2006). "Convexity, Classification, and Risk Bounds". Journal of the American Statistical Association. 101 (473): 138–156. ISSN 0162-1459.

[robust-7] Masnadi-Shirazi, Hamed; Vasconcelos, Nuno, On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost (PDF), Statistical Visual Computing Laboratory, University of California, San Diego, retrieved 6 December 2014

[8] Rifkin, Ryan M.; Lippert, Ross A. (1 May 2007), Notes on Regularized Least Squares (PDF), MIT Computer Science and Artificial Intelligence Laboratory

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]