Loss functions for classification: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 20:51, 24 April 2018 edit 2001:4898:e008:1:a1be:cc4d:9ee8:165b (talk) add exponential loss Tag: Visual edit ← Previous edit		Latest revision as of 23:53, 20 July 2025 edit undo Maxeto0910 (talk \| contribs) Extended confirmed users 117,373 edits →Proper loss functions, loss margin and regularization: no sentence Tag: Visual edit
(97 intermediate revisions by 41 users not shown)
Line 1: {{Short description\|Concept in machine learning}} [[File:Loss function surrogates.svg\|thumb\|Plot of various functions. Blue is the 0–1 indicator function. Green is the square loss function. Purple is the hinge loss function. Yellow is the logistic loss function. Note that all surrogates give a loss penalty of 1 for {{math\|''y''{{=}}''f''(''x''{{=}} 0) }}]] {{Machine learning}} In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification\|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal \| last1 = Rosasco \| first1 = L. \| last2 = De Vito \| first2 = E. D. \| last3 = Caponnetto \| first3 = A. \| last4 = Piana \| first4 = M. \| last5 = Verri \| first5 = A. \| url = http://web.mit.edu/lrosasco/www/publications/loss.pdf\| title = Are Loss Functions All the Same? \| doi = 10.1162/089976604773135104 \| journal = Neural Computation \| volume = 16 \| issue = 5 \| pages = 1063–1076 \| year = 2004 \| pmid = 15070510\| pmc = }}</ref> Given <math>X</math> as the vector space of all possible inputs, and ''Y'' = {–1,1} as the vector space of all possible outputs, we wish to find a function <math>f: X \mapsto \mathbb{R}</math> which best maps <math>\vec{x}</math> to <math>y</math>.<ref name="penn">{{Citation \| last= Shen \| first= Yi \| title= Loss Functions For Binary Classification and Class Probability Estimation \| publisher= University of Pennsylvania \| year= 2005 \| url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf \| accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation \| last= Rosasco \| first= Lorenzo \| last2= Poggio \| first2= Tomaso \| title= A Regularization Tour of Machine Learning \| series= MIT-9.520 Lectures Notes \| volume= Manuscript \| year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected risk, defined as [[File:BayesConsistentLosses2.jpg\|thumb\|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]] ~~:<math>I[f] = \displaystyle \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \, d\vec{x} \, dy</math>~~ {{Attention\|reason=Discuss the difference compared to scoring rules\|date=January 2024}} where <math>V(f(\vec{x}),y)</math> is the loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification\|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal \| last1 = Rosasco \| first1 = L. \| last2 = De Vito \| first2 = E. D. \| last3 = Caponnetto \| first3 = A. \| last4 = Piana \| first4 = M. \| last5 = Verri \| first5 = A. \| url = http://web.mit.edu/lrosasco/www/publications/loss.pdf\| title = Are Loss Functions All the Same? \| doi = 10.1162/089976604773135104 \| journal = Neural Computation \| volume = 16 \| issue = 5 \| pages = 1063–1076 \| year = 2004 \| pmid = 15070510\| citeseerx = 10.1.1.109.6786 \| s2cid = 11845688 }}</ref> Given <math>\mathcal{X}</math> as the space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and <math>\mathcal{Y} = \{ -1,1 \}</math> as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \to \mathcal{Y}</math> which best predicts a label <math>y</math> for a given input <math>\vec{x}</math>.<ref name="penn">{{Citation \| last= Shen \| first= Yi \| title= Loss Functions For Binary Classification and Class Probability Estimation \| publisher= University of Pennsylvania \| year= 2005 \| url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf \| access-date= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation \| last1= Rosasco \| first1= Lorenzo \| last2= Poggio \| first2= Tomaso \| title= A Regularization Tour of Machine Learning \| series= MIT-9.520 Lectures Notes \| volume= Manuscript \| year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as :<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math> where <math>V(f(\vec{x}),y)</math> is a given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as :<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math> Within classification, several commonly used [[loss functions]] are written solely in terms of the product of the true label <math>y</math> and the predicted label <math>f(\vec{x})</math>. Therefore, they can be defined as functions of only one variable <math>\upsilon=y f(\vec{x})</math>, so that <math>V(f(\vec{x}),y) = \phi(yf(\vec{x})) = \phi(\upsilon)</math> with a suitably chosen function <math>\phi:\mathbb{R}\to\mathbb{R}</math>. These are called '''margin-based loss functions'''. Choosing a margin-based loss function amounts to choosing <math>\phi</math>. Selection of a loss function within this framework impacts the optimal <math>f^{}_\phi</math> which minimizes the expected risk, see [[empirical risk minimization]]. In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid \|independently and identically distributed]] sample points In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically, :<math> \begin{align} I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \,d\vec{x} \,dy \\[6pt] & = \int_\mathcal{X} \int_\mathcal{Y} \phi(yf(\vec{x})) \, p(y\mid\vec{x}) \, p(\vec{x}) \,dy \,d\vec{x} \\[6pt] & = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, p(-1\mid\vec{x})]\, p(\vec{x})\,d\vec{x} \\[6pt] & = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, (1-p(1\mid\vec{x}))]\, p(\vec{x})\,d\vec{x} \end{align} </math> The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for <math>y</math>, and the fourth because <math>p(-1\mid x)=1-p(1\mid x)</math>. The term within brackets <math> [\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) (1-p(1\mid\vec{x}))] </math> is known as the ''conditional risk.'' One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation :<math>\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0, \;\;\;\;\;(1)</math>{{Citation needed\|date=February 2023}}{{Clarify\|reason=What is η?\|date=February 2023}} where <math> \eta=p(y=1\|\vec{x}) </math>, which is also equivalent to setting the derivative of the conditional risk equal to zero. Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by :<math>V(f(\vec{x}),y)=H(-yf(\vec{x}))</math> where <math>H</math> indicates the [[Heaviside step function]]. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation \| last= Piyush \| first= Rai \| title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers \| publisher= Utah CS5350/6350: Machine Learning \| date= 13 September 2011 \| url= https://cis.temple.edu/~latecki/Courses/AI-Fall12/Lectures/SVM.pdf \| access-date= 4 May 2021}}</ref> As a result, it is better to substitute '''loss function surrogates''' which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci">{{Citation \| last= Ramanan \| first= Deva \| title= Lecture 14 \| publisher= UCI ICS273A: Machine Learning \| date= 27 February 2008 \| url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf \| access-date= 6 December 2014}}</ref> Some of these surrogates are described below. In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid\|independently and identically distributed]] sample points :<math>S = \{(\vec{x}_1,y_1), \dots ,(\vec{x}_n,y_n)\}</math> drawn from the data [[sample space]], one seeks to [[empirical risk minimization \| minimize empirical risk]] :<math>I_S[f] = \frac{1}{n} \sum_{i=1}^n V( f(\vec{x}_i),y_i)</math> Line 16 ⟶ 50: as a proxy for expected risk.<ref name="mitlec" /> (See [[statistical learning theory]] for a more detailed description.) ==Bayes consistency== For computational ease, it is standard practice to write [[loss functions]] as functions of only one variable. Within classification, loss functions are generally written solely in terms of the product of the true classifier <math>y</math> and the predicted value <math>f(\vec{x})</math>.<ref name="robust"> {{Citation \| last= Masnadi-Shirazi \| first= Hamed \| last2= Vasconcelos \| first2= Nuno \| title= On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost \| publisher= Statistical Visual Computing Laboratory, University of California, San Diego \| url= http://www.svcl.ucsd.edu/publications/conference/2008/nips08/NIPS08LossesWITHTITLE.pdf \| accessdate= 6 December 2014}}</ref> Selection of a loss function within this framework Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^_{0/1}</math>, i.e., the one that minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of ~~:<math>V(f(\vec{x}),y)=\phi(-yf(\vec{x}))</math>~~ ~~impacts the optimal <math>f^{}_S</math> which [[empirical risk minimization \|minimizes empirical risk]], as well as the computational complexity of the learning algorithm.~~ :<math>f^_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>. Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the 0–1 [[indicator function]] which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by ~~:<math>V(f(\vec{x}),y)=H(-yf(\vec{x}))</math>~~ ~~where <math>H</math> indicates the [[Heaviside step function]].~~ However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation \| last= Piyush \| first= Rai \| title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers \| publisher= Utah CS5350/6350: Machine Learning \| date= 13 September 2011 \| url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf \| accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci"> {{Citation \| last= Ramanan \| first= Deva \| title= Lecture 14 \| publisher= UCI ICS273A: Machine Learning \| date= 27 February 2008 \| url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf \| accessdate= 6 December 2014}}</ref> Some of these surrogates are described below. A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^_{\phi}</math> is such that <math>f^_{0/1}(\vec{x}) = \operatorname{sgn}(f^_{\phi}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions. ~~==Bounds for classification==~~ ~~Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^</math> for a binary classification problem is equivalent to~~ ~~:<math>f^(\vec{x}) \;=\; \begin{cases} 1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>~~ ~~(when <math>p(1\mid\vec{x}) \ne p(-1\mid\vec{x})</math>).~~ For convex margin loss <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math> is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)<0</math>.<ref>{{Cite journal\|last1=Bartlett\|first1=Peter L.\|last2=Jordan\|first2=Michael I.\|last3=Mcauliffe\|first3=Jon D.\|date=2006\|title=Convexity, Classification, and Risk Bounds\|journal=Journal of the American Statistical Association\|volume=101\|issue=473\|pages=138–156\|issn=0162-1459\|jstor=30047445\|doi=10.1198/016214505000000907\|s2cid=2833811}}</ref><ref name="mit" /> Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation <ref name=":0">{{Cite journal\|last1=Masnadi-Shirazi\|first1=Hamed\|last2=Vasconcelos\|first2=Nuno\|date=2008\|title=On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost\|url=https://papers.nips.cc/paper/3591-on-the-design-of-loss-functions-for-classification-theory-robustness-to-outliers-and-savageboost.pdf\|journal=Proceedings of the 21st International Conference on Neural Information Processing Systems\|series=NIPS'08\|___location=USA\|publisher=Curran Associates Inc.\|pages=1049–1056\|isbn=9781605609492}}</ref> Furthermore, it can be shown that for any convex loss function <math>V(yf_0(\vec{x}))</math>, where <math>f_0</math> is the function that minimizes this loss, if <math>f_0(\vec{x}) \ne 0</math> and <math>V</math> is decreasing in a neighborhood of 0, then <math>f^(\vec{x}) = \operatorname{sgn}(f_0(\vec{x}))</math> where <math>\operatorname{sgn}</math> is the [[sign function]] (for proof see <ref name="mit" />). Note also that <math>f_0(\vec{x}) \ne 0</math> in practice when the loss function is differentiable at the origin. This fact confers a consistency property upon all convex loss functions; specifically, all convex loss functions will lead to consistent results with the 0–1 loss function given the presence of infinite data. Consequently, we can bound the difference of any of these convex loss function from expected risk.<ref name="mit" /> :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>, ~~==Simplifying expected risk for classification==~~ ~~Given the properties of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,~~ where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math> is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math> and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.<ref name=":0" /><ref>{{Cite book\|last1=Leistner\|first1=C.\|last2=Saffari\|first2=A.\|last3=Roth\|first3=P. M.\|last4=Bischof\|first4=H.\|title=2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops \|chapter=On robustness of on-line boosting - a competitive study \|date=September 2009\|pages=1362–1369\|doi=10.1109/ICCVW.2009.5457451\|isbn=978-1-4244-4442-7\|s2cid=6032045}}</ref> For all loss functions generated from (2), the posterior probability <math>p(y=1\|\vec{x})</math> can be found using the invertible ''link function'' as <math>p(y=1\|\vec{x})=\eta=f^{-1}(v)</math>. Such loss functions where the posterior probability can be recovered using the invertible link are called ''proper loss functions''. ~~:<math>~~ {\| class="wikitable" ~~\begin{align}~~ \|+Table-I ~~I[f] & = \int_{X \times Y} V(f(\vec{x}),y) p(\vec{x},y) \,d\vec{x} \,dy \\[6pt]~~ !Loss name ~~& = \int_X \int_Y V(-yf(\vec{x})) p(y\mid\vec{x})p(\vec{x}) \,dy \,d\vec{x} \\[6pt]~~ !<math>\phi(v)</math> ~~& = \int_X [V(-f(\vec{x})) p(1\mid\vec{x})+V(f(\vec{x})) p(-1\mid\vec{x})]p(\vec{x})\,d\vec{x} \\[6pt]~~ !<math>C(\eta)</math> ~~& = \int_X [V(-f(\vec{x})) p(1\mid\vec{x})+V(f(\vec{x})) (1-p(1\mid\vec{x}))]p(\vec{x})\,d\vec{x}~~ !<math>f^{-1}(v)</math> !<math>f(\eta)</math> \|- \|Exponential \|<math>e^{-v}</math> \|<math>2\sqrt{\eta(1-\eta)}</math> \|<math>\frac{e^{2v}}{1+e^{2v}}</math> \|<math>\frac{1}{2}\log(\frac{\eta}{1-\eta})</math> \|- \|Logistic \|<math>\frac{1}{\log(2)}\log(1+e^{-v})</math> \|<math>\frac{1}{\log(2)}[-\eta\log(\eta)-(1-\eta)\log(1-\eta)]</math> \|<math>\frac{e^v}{1+e^v}</math> \|<math>\log(\frac{\eta}{1-\eta})</math> \|- \|Square \|<math>(1-v)^2</math> \|<math>4\eta(1-\eta)</math> \|<math>\frac{1}{2}(v+1)</math> \|<math>2\eta-1</math> \|- \|Savage \|<math>\frac{1}{(1+e^v)^2}</math> \|<math>\eta(1-\eta)</math> \|<math>\frac{e^v}{1+e^v}</math> \|<math>\log(\frac{\eta}{1-\eta})</math> \|- \|Tangent \|<math>(2\arctan(v)-1)^2</math> \|<math>4\eta(1-\eta)</math> \|<math>\arctan(v)+\frac{1}{2}</math> \|<math>\tan(\eta-\frac{1}{2})</math> \|}<br />The sole minimizer of the expected risk, <math>f^_{\phi}</math>, associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the corresponding <math> f(\eta) </math>. This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such as [[gradient boosting]] can be used to construct the minimizer. ==Proper loss functions, loss margin and regularization== [[File:LogitLossMarginWithMu.jpg\|alt=\|thumb\|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>)]] For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal\|last1=Vasconcelos\|first1=Nuno\|last2=Masnadi-Shirazi\|first2=Hamed\|date=2015\|title=A View of Margin Losses as Regularizers of Probability Estimates\|url=http://jmlr.org/papers/v16/masnadi15a.html\|journal=Journal of Machine Learning Research\|volume=16\|issue=85\|pages=2751–2795\|issn=1533-7928}}</ref> Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math> where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[gradient boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math> improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>. In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier. ==Square loss== While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. It can be generated using (2) and Table-I as follows :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 4(\frac{1}{2}(v+1))(1-\frac{1}{2}(v+1))+(1-\frac{1}{2}(v+1))(4-8(\frac{1}{2}(v+1)))=(1-v)^2.</math> The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.<ref name="mit" /> In addition, functions which yield high values of <math>f(\vec{x})</math> for some <math>x \in X</math> will perform poorly with the square loss function, since high values of <math>yf(\vec{x})</math> will be penalized severely, regardless of whether the signs of <math>y</math> and <math>f(\vec{x})</math> match. A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for [[Tikhonov regularization]], one can solve for the regularization parameter using leave-one-out [[cross-validation (statistics)\|cross-validation]] in the same time as it would take to solve a single problem.<ref>{{Citation\| last1= Rifkin\| first1= Ryan M.\| last2= Lippert\| first2= Ross A.\| title= Notes on Regularized Least Squares\| publisher= MIT Computer Science and Artificial Intelligence Laboratory\| date= 1 May 2007\|url=https://dspace.mit.edu/bitstream/handle/1721.1/37318/MIT-CSAIL-TR-2007-025.pdf?sequence=1}}</ref> The minimizer of <math>I[f]</math> for the square loss function can be directly found from equation (1) as :<math>f^_\text{Square}= 2\eta-1=2p(1\mid x)-1.</math> == Logistic loss == The logistic loss function can be generated using (2) and Table-I as follows :<math>\begin{align} \phi(v) &= C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\, C'\left[f^{-1}(v)\right] \\ &= \frac{1}{\log(2)}\left [\frac{-e^v}{1+e^v}\log\frac{e^v}{1+e^v}-\left(1-\frac{e^v}{1+e^v}\right)\log\left(1-\frac{e^v}{1+e^v}\right)\right ]+\left(1-\frac{e^v}{1+e^v}\right) \left [\frac{-1}{\log(2)}\log\left(\frac{\frac{e^v}{1+e^v}}{1-\frac{e^v}{1+e^v}}\right)\right] \\ &=\frac{1}{\log(2)}\log(1+e^{-v}). \end{align} </math> The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the [[LogitBoost\|LogitBoost algorithm]]. The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for <math>y</math>, and the fourth because <math>p(-1\mid x)=1-p(1\mid x)</math>. As a result, one can solve for the minimizers of <math>I[f]</math> for any convex loss functions with these properties by differentiating the last equality with respect to <math>f</math> and setting the derivative equal to 0. Thus, minimizers for all of the loss function surrogates described below are easily obtained as functions of only <math>f(\vec{x})</math> and <math>p(1\mid x)</math>.<ref name="mitlec" /> The minimizer of <math>I[f]</math> for the logistic loss function can be directly found from equation (1) as ~~== Square loss ==~~ ~~While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. Defined as~~ ~~:<math>V(f(\vec{x}),y) = (1-yf(\vec{x}))^2</math>~~ the square loss function is both convex and smooth and matches the 0–1 [[indicator function]] when <math>yf(\vec{x})= 0</math> and when <math>yf(\vec{x}) = 1</math>. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.<ref name="mit" /> In addition, functions which yield high values of <math>f(\vec{x})</math> for some <math>x \in X</math> will perform poorly with the square loss function, since high values of <math>yf(\vec{x})</math> will be penalized severely, regardless of whether the signs of <math>y</math> and <math>f(\vec{x})</math> match. :<math>f^_\text{Logistic}= \log\left(\frac{\eta}{1-\eta}\right)=\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right).</math> A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for [[Tikhonov regularization]], one can solve for the regularization parameter using leave-one-out [[cross-validation (statistics) \|cross-validation]] in the same time as it would take to solve a single problem.<ref>{{Citation\| last= Rifkin\| first= Ryan M.\| last2= Lippert\| first2= Ross A.\| title= Notes on Regularized Least Squares\| publisher= MIT Computer Science and Artificial Intelligence Laboratory\| date= 1 May 2007}}</ref> This function is undefined when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math> (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when <math>p(1\mid x)</math> increases and equals 0 when <math>p(1\mid x)= 0.5</math>.<ref name="mitlec" /> ~~The minimizer of <math>I[f]</math> for the square loss function is~~ ~~:<math>f^_\text{Square}= 2p(1\mid x)-1</math>~~ This function notably equals <math>f^</math> for the 0–1 loss function when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math>, but predicts a value between the two classifications when the classification of <math>\vec{x}</math> is not known with absolute certainty. It's easy to check that the logistic loss and binary [[cross-entropy]] loss (Log loss) are in fact the same (up to a multiplicative constant <math>\frac{1}{\log(2)}</math>). The cross-entropy loss is closely related to the [[Kullback–Leibler divergence]] between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in modern [[deep learning\|deep neural networks]]. ~~== Hinge loss ==~~ ~~{{main\|Hinge loss}}~~ ~~The hinge loss function is defined as~~ == Exponential loss == ~~:<math>V(f(\vec{x}),y) = \max(0, 1-yf(\vec{x})) = \|1 - yf(\vec{x}) \|_{+}.</math>~~ The exponential loss function can be generated using (2) and Table-I as follows :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{\left(\frac{e^{2v}}{1+e^{2v}}\right)\left(1-\frac{e^{2v}}{1+e^{2v}}\right)}+\left(1-\frac{e^{2v}}{1+e^{2v}}\right)\left(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}\right) = e^{-v}</math> The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>\|yf(\vec{x})\| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" /> The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponentially-weighted 0-1 loss is used in the [[AdaBoost\|AdaBoost algorithm]] giving implicitly rise to the exponential loss. While the hinge loss function is both convex and continuous, it is not smooth (that is not differentiable) at <math>yf(\vec{x})=1</math>. Consequently, the hinge loss function cannot be used with [[gradient descent]] methods or [[stochastic gradient descent]] methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at <math>yf(\vec{x})=1</math>, which allows for the utilization of [[subgradient method \| subgradient descent methods]].<ref name="Utah" /> SVMs utilizing the hinge loss function can also be solved using [[quadratic programming]]. The minimizer of <math>I[f]</math> for the ~~hinge~~exponential loss function is can be directly found from equation (1) as :<math>f^_\text{~~Hinge~~Exp}(= \~~vec~~frac{x1}{2}) \;=log\; left(\~~begin~~frac{~~cases~~\eta} {1& -\~~text{if~~ eta}~~p(1~~\~~mid~~right)=\~~vec~~frac{x1}~~) > p(-1\mid\vec~~{x2}) \log\ ~~-1 &~~ left(\~~text~~frac{~~if }~~p(1\mid~~\vec{~~ x)}~~) <~~ {1-p(-1\mid~~\vec{~~ x)}\right) ~~\end{cases}~~.</math> == Savage loss == when <math>p(1\mid x) \ne 0.5</math>, which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.<ref name="mit" /> The Savage loss<ref name=":0" /> can be generated using (2) and Table-I as follows :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = \left(\frac{e^v}{1+e^v}\right)\left(1-\frac{e^v}{1+e^v}\right)+\left(1-\frac{e^v}{1+e^v}\right)\left(1-\frac{2e^v}{1+e^v}\right) = \frac{1}{(1+e^v)^2}.</math> ~~== Logistic loss ==~~ ~~The logistic loss function is defined as~~ The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in [[gradient boosting]] and the SavageBoost algorithm. ~~:<math>V(f(\vec{x}),y) = \frac{1}{\ln 2}\ln(1+e^{-yf(\vec{x})})</math>~~ The minimizer of <math>I[f]</math> for the Savage loss function can be directly found from equation (1) as This function displays a similar convergence rate to the hinge loss function, and since it is continuous, [[gradient descent]] methods can be utilized. However, the logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence (i.e., with high values of <math>\|f(\vec{x})\|</math>) are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data. :<math>f^_\text{Savage}= \log\left(\frac{\eta}{1-\eta}\right)=\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right).</math> ~~The minimizer of <math>I[f]</math> for the logistic loss function is~~ == Tangent loss == ~~:<math>f^_\text{Logistic}= \ln\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right).</math>~~ The Tangent loss<ref>{{Cite book\|last1=Masnadi-Shirazi\|first1=H.\|last2=Mahadevan\|first2=V.\|last3=Vasconcelos\|first3=N.\|title=2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition \|chapter=On the design of robust classifiers for computer vision \|date=June 2010\|pages=779–786\|doi=10.1109/CVPR.2010.5540136\|citeseerx=10.1.1.172.6416\|isbn=978-1-4244-6984-0\|s2cid=632758}}</ref> can be generated using (2) and Table-I as follows :<math> This function is undefined when <math>p(1\mid x)=1</math> or <math>p(1\mid x)=0</math> (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when <math>p(1\mid x)</math> increases and equals 0 when <math>p(1\mid x)= 0.5</math>.<ref name="mitlec" /> \begin{align} \phi(v) & = C[f^{-1}(v)]+\left( 1-f^{-1}(v)\right) C'[f^{-1}(v)] \\ & = 4 \left( \arctan(v)+\frac{1}{2} \right) \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) + \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) \left( 4-8 \left( \arctan(v)+\frac{1}{2} \right) \right) \\ & = \left( 2\arctan(v)-1 \right) ^2. \end{align} </math> The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in [[gradient boosting]], the TangentBoost algorithm and Alternating Decision Forests.<ref>{{Cite book\|last1=Schulter\|first1=S.\|last2=Wohlhart\|first2=P.\|last3=Leistner\|first3=C.\|last4=Saffari\|first4=A.\|last5=Roth\|first5=P. M.\|last6=Bischof\|first6=H.\|title=2013 IEEE Conference on Computer Vision and Pattern Recognition \|chapter=Alternating Decision Forests \|date=June 2013\|pages=508–515\|doi=10.1109/CVPR.2013.72\|citeseerx=10.1.1.301.1305\|isbn=978-0-7695-4989-7\|s2cid=6557162}}</ref> ~~== Cross entropy loss ==~~ ~~{{main\|Cross entropy}}~~ ~~Using the alternative label convention <math>t=(1+y)/2</math> so that <math>t \in \{0,1\}</math>, the cross entropy loss is defined as~~ The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as ~~:<math>V(f(\vec{x}),t) = -t\ln(f(\vec{x}))-(1-t)\ln(1-f(\vec{x}))</math>~~ :<math>f^_\text{Tangent}= \tan \left( \eta-\frac{1}{2} \right) =\tan \left( p \left( 1\mid x \right) -\frac{1}{2}\right) .</math> The cross entropy loss is closely related to the [[Kullback-Leibler divergence]] between the empirical distribution and the predicted distribution. This function is not naturally represented as a product of the true label and the predicted value, but is convex and can be minimized using [[stochastic gradient descent]] methods. The cross entropy loss is ubiquitous in modern [[deep learning\|deep neural networks]]. == ~~Exponential~~Hinge loss == {{main\|Hinge loss}} ~~The exponential loss function is defined as~~ The hinge loss function is defined with <math>\phi(\upsilon) = \max(0, 1-\upsilon) = [1-\upsilon]_{+}</math>, where <math>[a]_{+} = \max(0,a)</math> is the [[positive part]] function. :<math>V(f(\vec{x}),y) = e^\max(0, 1-yf(\vec{x})) = [1 - yf(\vec{x}) ]_{+}.</math> The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>\|yf(\vec{x})\| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" /> ~~It penalizes incorrect predictions more than Hinge loss and has larger gradient.~~ While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at <math>yf(\vec{x})=1</math>. Consequently, the hinge loss function cannot be used with [[gradient descent]] methods or [[stochastic gradient descent]] methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at <math>yf(\vec{x})=1</math>, which allows for the utilization of [[subgradient method\|subgradient descent methods]].<ref name="Utah" /> SVMs utilizing the hinge loss function can also be solved using [[quadratic programming]]. The minimizer of <math>I[f]</math> for the hinge loss function is :<math>f^_\text{Hinge}(\vec{x}) \;=\; \begin{cases} 1& \text{if }p(1\mid\vec{x}) > p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math> when <math>p(1\mid x) \ne 0.5</math>, which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.<ref name="mit" /> The Hinge loss cannot be derived from (2) since <math>f^_{\text{Hinge}}</math> is not invertible. == Generalized smooth hinge loss == The generalized smooth hinge loss function with parameter <math>\alpha</math> is defined as :<math>f^_\alpha(z) \;=\; \begin{cases} \frac{\alpha}{\alpha + 1} - z & \text{if }z \leq 0 \\ \frac{1}{\alpha + 1}z^{\alpha + 1} - z + \frac{\alpha}{\alpha + 1} & \text{if } 0<z<1 \\ 0 & \text{if } z \geq 1 \end{cases},</math> where :<math>z = yf(\vec{x}).</math> It is monotonically increasing and reaches 0 when <math>z = 1</math>. == See also == [[Differentiable programming]] *[[Scoring function]] == References == {{Reflist}} {{Artificial intelligence navbox}} [[Category:Machine learning algorithms]]