Probabilistic classification: Difference between revisions

Content deleted Content added
QuoJar (talk | contribs)
Probability calibration: Fixing style/layout errors
QuoJar (talk | contribs)
Revert to revision 902411587 dated 2019-06-18 16:55:54 by Suriname0 using popups
Line 23:
 
==Probability calibration==
Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers, [[decision tree learning|decision trees]] and [[Boosting (machine learning)|boosting]] methods, produce distorted class probability distributions.<ref name="Niculescu">{{cite conference |last1=Niculescu-Mizil |first1=Alexandru |first2=Rich |last2=Caruana |title=Predicting good probabilities with supervised learning |conference=ICML |year=2005 |url=http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |doi=10.1145/1102351.1102430 |deadurl=yes |archiveurl=https://web.archive.org/web/20140311005243/http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |archivedate=2014-03-11 |df= }}</ref> In the case of decision trees, where {{math|Pr(''y''{{!}}'''x''')}} is the proportion of training samples with label {{mvar|y}} in the leaf where {{math|'''x'''}} ends up, these distortions come about because learning algorithms such as [[C4.5]] or [[Predictive analytics#Classification and regression trees|CART]] explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high [[Bias of an estimator|bias]]) while using few samples to estimate the relevant proportion (high [[Bias–variance tradeoff|variance]]).<ref>{{cite conference |first1=Bianca |last1=Zadrozny |first2=Charles |last2=Elkan |title=Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers |url=http://cseweb.ucsd.edu/~elkan/calibrated.pdf |year=2001 |conference=ICML |pages=609–616}}</ref>
 
[[File:Calibration plot.png|thumb|An example calibration plot]] Calibration can be assessed using a '''calibration plot''' (also called a '''reliability diagram''').<ref name="Niculescu" /><ref>{{Cite web|url=https://jmetzen.github.io/2015-04-14/calibration.html|title=Probability calibration|website=jmetzen.github.io|access-date=2019-06-18}}</ref> A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly [[Calibration (statistics)|calibrated]] class membership probabilities.
 
For the [[binary classification|binary]] case, a common approach is to apply [[Platt scaling]], which learns a [[logistic regression]] model on the scores.<ref name="platt99">{{cite journal |last=Platt |first=John |authorlink=John Platt (computer scientist) |title=Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods |journal=Advances in Large Margin Classifiers |volume=10 |issue=3 |year=1999 |pages=61–74 |url=https://www.researchgate.net/publication/2594015}}</ref>
An alternative method using [[isotonic regression]]<ref>{{Cite book | last1 = Zadrozny | first1 = Bianca| last2 = Elkan | first2 = Charles| doi = 10.1145/775047.775151 | chapter = Transforming classifier scores into accurate multiclass probability estimates | chapter-url = http://www.cs.cornell.edu/courses/cs678/2007sp/ZadroznyElkan.pdf| title = Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02 | pages = 694–699| year = 2002 | isbn = 978-1-58113-567-1| pmid = | pmc = | id = [[CiteSeerX]]: {{URL|1=citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7457|2=10.1.1.13.7457}}| citeseerx = 10.1.1.164.8140}}</ref><ref>Rosen, David B.; Burke, Harry B.; Goodman, Philip H. (1996) "Improving Prediction Accuracy Using a Calibration Postprocessor". ''World Congress on Neural Networks (WCNN, San Diego, California, September 1996),'' Erlbaum. "... go through the cases sequentially, enforcing monotonicity. This is the strategy employed by the 'Pool Adjacent Violators' (Barlow et al., 1972) algorithms used to perform monotone regression".</ref> is generally superior to Platt's method when sufficient training data is available.<ref name="Niculescu"/>
 
In the [[multiclass classification|multiclass]] case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.<ref>{{Cite journal | last1 = Hastie | first1 = Trevor| last2 = Tibshirani | first2 = Robert| doi = 10.1214/aos/1028144844 | title = Classification by pairwise coupling | journal = [[The Annals of Statistics]] | volume = 26 | issue = 2 | pages = 451–471| year = 1998 | pmid = | pmc = | zbl = 0932.62071| id = [[CiteSeerX]]: {{URL|1=citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.6032|2=10.1.1.46.6032}}}}</ref>