Revision as of 14:57, 1 August 2019 edit 64.58.145.95 (talk) No edit summary Tag: Visual edit ← Previous edit		Revision as of 17:24, 1 August 2019 edit undo 64.58.145.95 (talk) added proper loss,loss maargin, regularization section Tag: Visual edit Next edit →
Line 54: ==Bayes Consistency== Utilizing [[Bayes' theorem]], it can be shown that the optimal <math>f^_{0/1}</math>which minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of :<math>f^_{0/1}(\vec{x}) \;=\; \begin{cases} \;\;\;1& \text{if }p(1\mid\vec{x}) > p(-1\mid \vec{x}) \\ \;\;\;0 & \text{if }p(1\mid\vec{x}) = p(-1\mid\vec{x}) \\ -1 & \text{if }p(1\mid\vec{x}) < p(-1\mid\vec{x}) \end{cases}</math>. Line 62 ⟶ 63: <math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] \;\;\;\;\;(2)</math>, where <math>f(\eta), (0\leq \eta \leq 1)</math> is any invertible function such that <math>f^{-1}(-v)=1-f^{-1}(v)</math> and <math>C(\eta)</math>is any differentiable strictly concave function such that <math>C(\eta)=C(1-\eta)</math>. Table-I shows the generated Bayes consistent loss functions for some example choices of <math>C(\eta)</math>and <math>f^{-1}(v)</math>. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification<ref name=":0" /><ref>{{Cite journal\|last=Leistner\|first=C.\|last2=Saffari\|first2=A.\|last3=Roth\|first3=P. M.\|last4=Bischof\|first4=H.\|date=2009-9\|title=On robustness of on-line boosting - a competitive study\|url=https://ieeexplore.ieee.org/document/5457451\|journal=2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops\|pages=1362–1369\|doi=10.1109/ICCVW.2009.5457451\|isbn=978-1-4244-4442-7}}</ref>. For all loss functions generated from (2) , the posterior probability <math>p(y=1\|\vec{x})</math> can be found using the invertible ''link function'' as <math>p(y=1\|\vec{x})=\eta=f^{-1}(v)</math>. Such loss functions where the posterior probability can be recovered using the invertible link are called ''proper loss functions''. {\| class="wikitable" \|+Table-I Line 103 ⟶ 104: f(\eta) </math>. This holds even for the nonconvex loss functions which means that gradient descent based algorithms such as [[Gradient boosting\|Gradient Boosting]] can be used to construct the minimizer. <br /> == Square loss ==▼ ==Proper Loss Functions, Loss Margin and Regularization== For proper loss functions, the ''loss margin'' can be defined as <math>\mu_{\phi}=-\frac{\phi'(0)}{\phi''(0)}</math> and shown to be directly related to the regularization properties of the classifier<ref>{{Cite journal\|last=Vasconcelos\|first=Nuno\|last2=Masnadi-Shirazi\|first2=Hamed\|date=2015\|title=A View of Margin Losses as Regularizers of Probability Estimates\|url=http://jmlr.org/papers/v16/masnadi15a.html\|journal=Journal of Machine Learning Research\|volume=16\|issue=85\|pages=2751–2795\|issn=1533-7928}}</ref>. Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a <math>\gamma</math> parameter and writing the logistic loss as <math>\frac{1}{\gamma}\log(1+e^{-\gamma v})</math>where smaller <math>0<\gamma<1</math> increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in [[Gradient boosting\|Gradient Boosting]] <math>F_m(x) = F_{m-1}(x) + \gamma h_m(x),</math> where decreasing <math>\gamma</math>improves the regularization of the boosted classifier. It should be noted that the theory makes it clear that when a learning rate of <math>\gamma</math> is used, the correct formula for retrieving the posterior probability is now <math>\eta=f^{-1}(\gamma F(x))</math>. In conclusion, by choosing a loss function with larger margin (smaller <math>\gamma</math>) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier. ▲== Square loss == While more commonly used in regression, the square loss function can be re-written as a function <math>\phi(yf(\vec{x}))</math> and utilized for classification. It can be generated using (2) and Table-I as follows :<math>\phi(v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)] = 4(\frac{1}{2}(v+1))(1-\frac{1}{2}(v+1))+(1-\frac{1}{2}(v+1))(4-8(\frac{1}{2}(v+1)))=(1-v)^2.</math>

Loss functions for classification: Difference between revisions