Content deleted Content added
formatting |
Fix cite date error |
||
Line 11:
:<math>V(f(\vec{x}),y)=\phi(yf(\vec{x}))=\phi(\upsilon)</math>
impacts the optimal <math>f^{*}_\phi</math> which minimizes the expected risk. Loss functions in this form are known as ''margin losses''.
In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
:<math>
Line 26:
The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for <math>y</math>, and the fourth because <math>p(-1\mid x)=1-p(1\mid x)</math>. The term within brackets <math>
[\phi(f(\vec{x})) p(1\mid\vec{x})+\phi(-f(\vec{x})) (1-p(1\mid\vec{x}))]
</math> is known as the ''conditional risk.''
One can solve for the minimizer of <math>I[f]</math> by taking the functional derivative of the last equality with respect to <math>f</math> and setting the derivative equal to 0. This will result in the following equation
Line 32:
:<math>
\frac{\partial \phi(f)}{\partial f}\eta + \frac{\partial \phi(-f)}{\partial f}(1-\eta)=0 \;\;\;\;\;(1)
</math>
which is also equivalent to setting the derivative of the conditional risk equal to zero.
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for [[false positives and false negatives]]) would be the [[0-1 loss function]] (0–1 [[indicator function]]), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
Line 40:
where <math>H</math> indicates the [[Heaviside step function]].
However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an [[NP-hard]] combinatorial optimization problem.<ref name="Utah">{{Citation | last= Piyush | first= Rai | title= Support Vector Machines (Contd.), Classification Loss Functions and Regularizers | publisher= Utah CS5350/6350: Machine Learning | date= 13 September 2011 | url= http://www.cs.utah.edu/~piyush/teaching/13-9-print.pdf | accessdate= 6 December 2014}}</ref> As a result, it is better to substitute continuous, convex '''loss function surrogates''' which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.<ref name="uci">
In practice, the probability distribution <math>p(\vec{x},y)</math> is unknown. Consequently, utilizing a training set of <math>n</math> [[iid
:<math>S = \{(\vec{x}_1,y_1), \dots ,(\vec{x}_n,y_n)\}</math>
drawn from the data [[sample space]], one seeks to [[empirical risk minimization
:<math>I_S[f] = \frac{1}{n} \sum_{i=1}^n V( f(\vec{x}_i),y_i)</math>
Line 57:
:<math>f^*_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>.
A loss function is said to be ''classification-calibrated or Bayes consistent'' if its optimal <math>f^*_{\phi}</math> is such that <math>f^*_{0/1}(\vec{x}) = \operatorname{sgn}(f^*_{\phi}(\vec{x}))</math>and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function <math>f^*_{\phi}</math> by directly minimizing the expected risk and without having to explicitly model the probability density functions.
For convex margin loss <math>\phi(\upsilon)</math>, it can be shown that <math>\phi(\upsilon)</math> is Bayes consistent if and only if it is differentiable at 0 and <math>\phi'(0)=0</math>.<ref>{{Cite journal|last=Bartlett|first=Peter L.|last2=Jordan|first2=Michael I.|last3=Mcauliffe|first3=Jon D.|date=2006|title=Convexity, Classification, and Risk Bounds|journal=Journal of the American Statistical Association|volume=101|issue=473|pages=138–156|issn=0162-1459|jstor=30047445|doi=10.1198/016214505000000907}}</ref><ref name="mit" />
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math> and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.<ref name=":0" /><ref>{{Cite journal|last=Leistner|first=C.|last2=Saffari|first2=A.|last3=Roth|first3=P. M.|last4=Bischof|first4=H.|date=September 2009
{| class="wikitable"
|+Table-I
Line 103:
|}<br />The sole minimizer of the expected risk, <math>f^*_{\phi}</math>, associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the corresponding <math>
f(\eta)
</math>. This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such as [[gradient boosting]] can be used to construct the minimizer.
==Proper loss functions, loss margin and regularization==
[[File:LogitLossMarginWithMu.jpg|alt=|thumb|(Red) standard Logistic loss (<math>\gamma=1, \mu=2</math>) and (Blue) increased margin Logistic loss (<math>\gamma=0.2</math>).]]
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier.<ref>{{Cite journal|last=Vasconcelos|first=Nuno|last2=Masnadi-Shirazi|first2=Hamed|date=2015|title=A View of Margin Losses as Regularizers of Probability Estimates|url=http://jmlr.org/papers/v16/masnadi15a.html|journal=Journal of Machine Learning Research|volume=16|issue=85|pages=2751–2795|issn=1533-7928}}</ref>
In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.
==Square loss==
Line 117:
The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.<ref name="mit" /> In addition, functions which yield high values of <math>f(\vec{x})</math> for some <math>x \in X</math> will perform poorly with the square loss function, since high values of <math>yf(\vec{x})</math> will be penalized severely, regardless of whether the signs of <math>y</math> and <math>f(\vec{x})</math> match.
A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for [[Tikhonov regularization]], one can solve for the regularization parameter using leave-one-out [[cross-validation (statistics)
The minimizer of <math>I[f]</math> for the square loss function can be directly found from equation (1) as
Line 133:
</math>
The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the [[LogitBoost|LogitBoost algorithm]].
The minimizer of <math>I[f]</math> for the logistic loss function can be directly found from equation (1) as
Line 148:
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 2\sqrt{(\frac{e^{2v}}{1+e^{2v}})(1-\frac{e^{2v}}{1+e^{2v}})}+(1-\frac{e^{2v}}{1+e^{2v}})(\frac{1-\frac{2e^{2v}}{1+e^{2v}}}{\sqrt{\frac{e^{2v}}{1+e^{2v}}(1-\frac{e^{2v}}{1+e^{2v}})}}) = e^{-v}</math>
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponential loss is used in the [[AdaBoost|AdaBoost algorithm]].
The minimizer of <math>I[f]</math> for the exponential loss function can be directly found from equation (1) as
:<math>f^*_\text{Exp}= \frac{1}{2}\log\left(\frac{\eta}{1-\eta}\right)=\frac{1}{2}\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right).</math>
Line 159:
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = (\frac{e^v}{1+e^v})(1-\frac{e^v}{1+e^v})+(1-\frac{e^v}{1+e^v})(1-\frac{2e^v}{1+e^v}) = \frac{1}{(1+e^v)^2}.</math>
The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in [[gradient boosting]] and the SavageBoost algorithm.
The minimizer of <math>I[f]</math> for the Savage loss function can be directly found from equation (1) as
:<math>f^*_\text{Savage}= \log\left(\frac{\eta}{1-\eta}\right)=\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right).</math>
== Tangent loss ==
The Tangent loss<ref>{{Cite journal|last=Masnadi-Shirazi|first=H.|last2=Mahadevan|first2=V.|last3=Vasconcelos|first3=N.|date=June 2010
:<math>
Line 175:
</math>
The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in [[gradient boosting]], the TangentBoost algorithm and Alternating Decision Forests.<ref>{{Cite journal|last=Schulter|first=S.|last2=Wohlhart|first2=P.|last3=Leistner|first3=C.|last4=Saffari|first4=A.|last5=Roth|first5=P. M.|last6=Bischof|first6=H.|date=June 2013
The minimizer of <math>I[f]</math> for the Tangent loss function can be directly found from equation (1) as
:<math>f^*_\text{Tangent}= \tan(\eta-\frac{1}{2})=\tan(p(1\mid x)-\frac{1}{2}).</math>
Line 189:
The hinge loss provides a relatively tight, convex upper bound on the 0–1 [[indicator function]]. Specifically, the hinge loss equals the 0–1 [[indicator function]] when <math>\operatorname{sgn}(f(\vec{x})) = y</math> and <math>|yf(\vec{x})| \geq 1</math>. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for [[support vector machines]] (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.<ref name="Utah" />
While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at <math>yf(\vec{x})=1</math>. Consequently, the hinge loss function cannot be used with [[gradient descent]] methods or [[stochastic gradient descent]] methods which rely on differentiability over the entire ___domain. However, the hinge loss does have a subgradient at <math>yf(\vec{x})=1</math>, which allows for the utilization of [[subgradient method
The minimizer of <math>I[f]</math> for the hinge loss function is
|