Loss functions for classification: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:39, 26 January 2024 edit AnomieBOT (talk \| contribs) Bots 6,855,637 edits m Dating maintenance tags: {{Attention}} ← Previous edit		Latest revision as of 23:53, 20 July 2025 edit undo Maxeto0910 (talk \| contribs) Extended confirmed users 116,705 edits →Proper loss functions, loss margin and regularization: no sentence Tag: Visual edit
(15 intermediate revisions by 9 users not shown)
Line 3: [[File:BayesConsistentLosses2.jpg\|thumb\|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]] {{Attention\|reason=Discuss the difference compared to scoring rules\|date=January 2024}} In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification\|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal \| last1 = Rosasco \| first1 = L. \| last2 = De Vito \| first2 = E. D. \| last3 = Caponnetto \| first3 = A. \| last4 = Piana \| first4 = M. \| last5 = Verri \| first5 = A. \| url = http://web.mit.edu/lrosasco/www/publications/loss.pdf\| title = Are Loss Functions All the Same? \| doi = 10.1162/089976604773135104 \| journal = Neural Computation \| volume = 16 \| issue = 5 \| pages = 1063–1076 \| year = 2004 \| pmid = 15070510\| citeseerx = 10.1.1.109.6786 \| s2cid = 11845688 }}</ref> Given <math>\mathcal{X}</math> as the space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and <math>\mathcal{Y} = \{ -1,1 \}</math> as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \to \mathcal{Y}</math> which best predicts a label <math>y</math> for a given input <math>\vec{x}</math>.<ref name="penn">{{Citation \| last= Shen \| first= Yi \| title= Loss Functions For Binary Classification and Class Probability Estimation \| publisher= University of Pennsylvania \| year= 2005 \| url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf \| access-date= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation \| last1= Rosasco \| first1= Lorenzo \| last2= Poggio \| first2= Tomaso \| title= A Regularization Tour of Machine Learning \| series= MIT-9.520 Lectures Notes \| volume= Manuscript \| year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as :<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math> Line 29 ⟶ 28: One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation :<math>\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0, \;\;\;\;\;(1)</math>{{Citation needed\|date=February 2023}}{{Clarify\|reason=What is η?\|date=February 2023}}▼ :<math>▼ ▲\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1) ~~</math>{{Citation needed\|date=February 2023}}{{Clarify\|reason=What is η?\|date=February 2023}}~~ ▲:where <math> which is also equivalent to setting the derivative of the conditional risk equal to zero.▼ \eta=p(y=1\|\vec{x}) ▲</math>, which is also equivalent to setting the derivative of the conditional risk equal to zero. Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by Line 105 ⟶ 104: ==Proper loss functions, loss margin and regularization== [[File:LogitLossMarginWithMu.jpg\|alt=\|thumb\|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>).]] For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal\|last1=Vasconcelos\|first1=Nuno\|last2=Masnadi-Shirazi\|first2=Hamed\|date=2015\|title=A View of Margin Losses as Regularizers of Probability Estimates\|url=http://jmlr.org/papers/v16/masnadi15a.html\|journal=Journal of Machine Learning Research\|volume=16\|issue=85\|pages=2751–2795\|issn=1533-7928}}</ref> Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math> where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[gradient boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math> improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>. Line 147 ⟶ 146: :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{\left(\frac{e^{2v}}{1+e^{2v}}\right)\left(1-\frac{e^{2v}}{1+e^{2v}}\right)}+\left(1-\frac{e^{2v}}{1+e^{2v}}\right)\left(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}\right) = e^{-v}</math> The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The ~~exponential~~exponentially-weighted 0-1 loss is used in the [[AdaBoost\|AdaBoost algorithm]] giving implicitly rise to the exponential loss. The minimizer of <math>I[f]</math> for the exponential loss function can be directly found from equation (1) as Line 169 ⟶ 168: :<math> \begin{align} \phi(v) & = C[f^{-1}(v)]+\left( 1-f^{-1}(v)\right) C'[f^{-1}(v)] ~~= 4(\arctan(v)+\frac{1}{2})(1-(\arctan(v)+\frac{1}{2}))+(1-(\arctan(v)+\frac{1}{2}))(4-8(\arctan(v)+\frac{1}{2}))\\~~ \\ & = 4 \left( \arctan(v)+\frac{1}{2} \right) \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) + \left( 1- \left( \arctan(v)+\frac{1}{2} \right) \right) \left( 4-8 \left( \arctan(v)+\frac{1}{2} \right) \right) \\ & = \left( 2\arctan(v)-1 \right) ^2. \end{align} </math> Line 178: The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as :<math>f^*_\text{Tangent}= \tan \left( \eta-\frac{1}{2} \right) =\tan \left( p \left( 1\mid x \right) -\frac{1}{2}\right) .</math> == Hinge loss == Line 213: {{Reflist}} {{Artificial intelligence navbox}} ~~{{Differentiable computing}}~~ [[Category:Machine learning algorithms]]