Revision as of 17:05, 28 February 2020 edit Citation bot (talk \| contribs) Bots 5,863,195 edits m Add: citeseerx. Removed URL that duplicated unique identifier. \| You can use this bot yourself. Report bugs here. \| Activated by User:AManWithNoPlan \| via #UCB_toolbar ← Previous edit		Revision as of 22:09, 13 July 2020 edit undo 86.1.93.141 (talk) No edit summary Tag: Visual edit: Switched Next edit →
Line 1: [[File:BayesConsistentLosses2.jpg\|thumb\|Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)]] In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification\|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal \| last1 = Rosasco \| first1 = L. \| last2 = De Vito \| first2 = E. D. \| last3 = Caponnetto \| first3 = A. \| last4 = Piana \| first4 = M. \| last5 = Verri \| first5 = A. \| url = http://web.mit.edu/lrosasco/www/publications/loss.pdf\| title = Are Loss Functions All the Same? \| doi = 10.1162/089976604773135104 \| journal = Neural Computation \| volume = 16 \| issue = 5 \| pages = 1063–1076 \| year = 2004 \| pmid = 15070510\| pmc = \| citeseerx = 10.1.1.109.6786 }}</ref> Given <math>\mathcal{X}</math> as the ~~vector~~ space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and ''<math>\mathcal{Y~~'' ~~} =~~ ~~ \{–1 -1,1 \}</math> as the ~~vector space~~set of ~~all~~labels (possible outputs), wea ~~wish~~typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \mapsto \mathbb{R}</math> which best ~~maps~~predicts a label <math>~~\vec{x}~~y</math> tofor a given input <math>y\vec{x}</math>.<ref name="penn">{{Citation \| last= Shen \| first= Yi \| title= Loss Functions For Binary Classification and Class Probability Estimation \| publisher= University of Pennsylvania \| year= 2005 \| url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf \| accessdate= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation \| last= Rosasco \| first= Lorenzo \| last2= Poggio \| first2= Tomaso \| title= A Regularization Tour of Machine Learning \| series= MIT-9.520 Lectures Notes \| volume= Manuscript \| year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as :<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) p(\vec{x},y) \, d\vec{x} \, dy</math> where <math>V(f(\vec{x}),y)</math> is ~~the~~a given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as :<math>p(\vec{x},y)=p(y\mid\vec{x}) p(\vec{x}).</math> ~~For~~Within ~~computational ease~~classification, itseveral ~~is standard practice to~~commonly ~~write~~used [[loss functions]] are written solely in terms of the product of the true label <math>y</math> and the predicted label <math>f(\vec{x})</math>. Therefore, they can be defined as functions of only one variable <math>\upsilon=y f(\vec{x})</math>., so ~~Within~~that ~~classification~~<math>V(f(\vec{x}),y) ~~loss~~= ~~functions~~\phi(yf(\vec{x})) ~~are~~= ~~generally~~\phi(\upsilon)</math> ~~written~~with ~~solely~~a insuitably ~~terms~~chosen ~~of the product of the true classifier~~function <math>y\phi:\mathbb{R}\to\mathbb{R}</math>. ~~and~~These ~~the~~are ~~predicted~~called ~~value~~'''margin-based loss functions'''. Choosing a margin-based loss function amounts to choosing <math>f(\~~vec{x})~~phi</math>. Selection of a loss function within this framework impacts the optimal <math>f^{}_\phi</math> which minimizes the expected risk. ~~:<math>V(f(\vec{x}),y)=\phi(yf(\vec{x}))=\phi(\upsilon)</math>~~ ~~impacts the optimal <math>f^{}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.~~ In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically, Line 17 ⟶ 13: :<math> \begin{align} I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) p(\vec{x},y) \,d\vec{x} \,dy \\[6pt] & = \~~int_X~~int_\mathcal{X} \~~int_Y~~int_\mathcal{Y} \phi(yf(\vec{x})) p(y\mid\vec{x})p(\vec{x}) \,dy \,d\vec{x} \\[6pt] & = \~~int_X~~int_\mathcal{X} [\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) p(-1\mid\vec{x})]p(\vec{x})\,d\vec{x} \\[6pt] & = \~~int_X~~int_\mathcal{X} [\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) (1-p(1\mid\vec{x}))]p(\vec{x})\,d\vec{x} \end{align} </math> Line 40 ⟶ 36: where <math>H</math> indicates the [[Heaviside step function]]. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation \| last= Piyush \| first= Rai \| title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers \| publisher= Utah CS5350/6350: Machine Learning \| date= 13 September 2011 \| url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf \| accessdate= 6 December 2014}}</ref> As a result, it is better to substitute ~~continuous, convex~~ '''loss function surrogates''' which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci">{{Citation \| last= Ramanan \| first= Deva \| title= Lecture 14 \| publisher= UCI ICS273A: Machine Learning \| date= 27 February 2008 \| url= http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf \| accessdate= 6 December 2014}}</ref> Some of these surrogates are described below. In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid\|independently and identically distributed]] sample points Line 183 ⟶ 179: == Hinge loss == {{main\|Hinge loss}} The hinge loss function is defined with <math>\phi(\upsilon) = \max(0, 1-\upsilon) = [1-\upsilon]_{+}</math>, where <math>[a]_{+} = \max(0,a)</math> is the [[positive part]] function. ~~The hinge loss function is defined as~~ :<math>V(f(\vec{x}),y) = \max(0, 1-yf(\vec{x})) = \|[1 - yf(\vec{x}) \|]_{+}.</math> The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>\|yf(\vec{x})\| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" />

Loss functions for classification: Difference between revisions