Content deleted Content added
No edit summary |
added proper loss,loss maargin, regularization section |
||
Line 54:
==Bayes Consistency==
Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^*_{0/1}</math>which minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of
:<math>f^*_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>.
Line 62 ⟶ 63:
<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>,
where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math>and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification<ref name=":0" /><ref>{{Cite journal|last=Leistner|first=C.|last2=Saffari|first2=A.|last3=Roth|first3=P. M.|last4=Bischof|first4=H.|date=2009-9|title=On robustness of on-line boosting - a competitive study|url=https://ieeexplore.ieee.org/document/5457451|journal=2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops|pages=1362–1369|doi=10.1109/ICCVW.2009.5457451|isbn=978-1-4244-4442-7}}</ref>. For all loss functions generated from (2) , the posterior probability <math>p(y=1|\vec{x})</math> can be found using the invertible ''link function'' as <math>p(y=1|\vec{x})=\eta=f^{-1}(v)</math>. Such loss functions where the posterior probability can be recovered using the invertible link are called ''proper loss functions''.
{| class="wikitable"
|+Table-I
Line 103 ⟶ 104:
f(\eta)
</math>. This holds even for the nonconvex loss functions which means that gradient descent based algorithms such as [[Gradient boosting|Gradient Boosting]] can be used to construct the minimizer.
<br />
== Square loss ==▼
==Proper Loss Functions, Loss Margin and Regularization==
For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier<ref>{{Cite journal|last=Vasconcelos|first=Nuno|last2=Masnadi-Shirazi|first2=Hamed|date=2015|title=A View of Margin Losses as Regularizers of Probability Estimates|url=http://jmlr.org/papers/v16/masnadi15a.html|journal=Journal of Machine Learning Research|volume=16|issue=85|pages=2751–2795|issn=1533-7928}}</ref>. Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math>where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[Gradient boosting|Gradient Boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math>improves the regularization of the boosted classifier. It should be noted that the theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>.
In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.
While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. It can be generated using (2) and Table-I as follows
:<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 4(\frac{1}{2}(v+1))(1-\frac{1}{2}(v+1))+(1-\frac{1}{2}(v+1))(4-8(\frac{1}{2}(v+1)))=(1-v)^2.</math>
|