Revision as of 13:09, 19 October 2022 edit 167.220.197.19 (talk) the output space for f should be Y not R ← Previous edit		Revision as of 09:51, 15 December 2022 edit undo 223.187.127.234 (talk) No edit summary Next edit →
Line 3: In [[machine learning]] and [[mathematical optimization]], '''loss functions for classification''' are computationally feasible [[loss functions]] representing the price paid for inaccuracy of predictions in [[statistical classification\|classification problem]]s (problems of identifying which category a particular observation belongs to).<ref name="mit">{{Cite journal \| last1 = Rosasco \| first1 = L. \| last2 = De Vito \| first2 = E. D. \| last3 = Caponnetto \| first3 = A. \| last4 = Piana \| first4 = M. \| last5 = Verri \| first5 = A. \| url = http://web.mit.edu/lrosasco/www/publications/loss.pdf\| title = Are Loss Functions All the Same? \| doi = 10.1162/089976604773135104 \| journal = Neural Computation \| volume = 16 \| issue = 5 \| pages = 1063–1076 \| year = 2004 \| pmid = 15070510\| citeseerx = 10.1.1.109.6786 \| s2cid = 11845688 }}</ref> Given <math>\mathcal{X}</math> as the space of all possible inputs (usually <math>\mathcal{X} \subset \mathbb{R}^d</math>), and <math>\mathcal{Y} = \{ -1,1 \}</math> as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function <math>f: \mathcal{X} \to \mathcal{Y}</math> which best predicts a label <math>y</math> for a given input <math>\vec{x}</math>.<ref name="penn">{{Citation \| last= Shen \| first= Yi \| title= Loss Functions For Binary Classification and Class Probability Estimation \| publisher= University of Pennsylvania \| year= 2005 \| url= http://stat.wharton.upenn.edu/~buja/PAPERS/yi-shen-dissertation.pdf \| access-date= 6 December 2014}}</ref> However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same <math>\vec{x}</math> to generate different <math>y</math>.<ref name="mitlec">{{Citation \| last1= Rosasco \| first1= Lorenzo \| last2= Poggio \| first2= Tomaso \| title= A Regularization Tour of Machine Learning \| series= MIT-9.520 Lectures Notes \| volume= Manuscript \| year= 2014}}</ref> As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as :<math>I[f] = \displaystyle \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \, d\vec{x} \, dy</math> where <math>V(f(\vec{x}),y)</math> is a given loss function, and <math>p(\vec{x},y)</math> is the [[probability density function]] of the process that generated the data, which can equivalently be written as Line 14: :<math> \begin{align} I[f] & = \int_{\mathcal{X} \times \mathcal{Y}} V(f(\vec{x}),y) \, p(\vec{x},y) \,d\vec{x} \,dy \\[6pt] & = \int_\mathcal{X} \int_\mathcal{Y} \phi(yf(\vec{x})) \, p(y\mid\vec{x}) \, p(\vec{x}) \,dy \,d\vec{x} \\[6pt] & = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, p(-1\mid\vec{x})]\, p(\vec{x})\,d\vec{x} \\[6pt] & = \int_\mathcal{X} [\phi(f(\vec{x})) \, p(1\mid\vec{x}) + \phi(-f(\vec{x})) \, (1-p(1\mid\vec{x}))]\, p(\vec{x})\,d\vec{x} \end{align} </math>

Loss functions for classification: Difference between revisions