Probabilistic classification: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:40, 10 September 2017 edit Edratzer (talk \| contribs) 380 edits →Probability calibration: Added calibration plot (or reliability diagram) ← Previous edit		Latest revision as of 17:26, 19 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →Probability calibration: HTTP to HTTPS for Cornell University Tag: AWB
(31 intermediate revisions by 21 users not shown)
Line 1: {{Short description\|Machine learning problem}} {{machine learning bar}} In [[machine learning]], a '''probabilistic classifier''' is a [[statistical classification\|classifier]] that is able to predict, given an observation of an input, a [[probability distribution]] over a [[Set (mathematics)\|set]] of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right<ref>{{cite book \|first1=Trevor \|last1=Hastie \|first2=Robert \|last2=Tibshirani \|first3=Jerome \|last3=Friedman \|year=2009 \|title=The Elements of Statistical Learning \|url=http://statweb.stanford.edu/~tibs/ElemStatLearn/ \|page=348 \|quote=[I]n [[data mining]] applications the interest is often more in the class probabilities <math>p_\ell(x), \ell = 1, \dots, K</math> themselves, rather than in performing a class assignment. \|url-status=dead \|archive-url=https://web.archive.org/web/20150126123924/http://statweb.stanford.edu/~tibs/ElemStatLearn/ \|archive-date=2015-01-26 }}</ref> or when combining classifiers into [[ensemble classifier\|ensembles]]. ==Types of classification== Line 7 ⟶ 8: :<math>\hat{y} = f(x)</math> The samples come from some set {{mvar\|X}} (e.g., the set of all [[document classification\|documents]], or the set of all [[Computer vision#~~Recognition~~Bababoui\|images]]), while the class labels form a finite set {{mvar\|Y}} defined prior to training. Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are [[conditional probability\|conditional]] distributions <math>\Pr(Y \vert X)</math>, meaning that for a given <math>x \in X</math>, they assign probabilities to all <math>y \in Y</math> (and these probabilities sum to one). "Hard" classification can then be done using the [[Bayes estimator\|optimal decision rule]]<ref name="bishop">{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer}}</ref>{{rp\|39–40}} Line 15 ⟶ 16: or, in English, the predicted class is that which has the highest probability. Binary probabilistic classifiers are also called [[~~binomial~~binary regression]] models in [[statistics]]. In [[econometrics]], probabilistic classification in general is called [[discrete choice]]. Some classification models, such as [[naive Bayes classifier\|naive Bayes]], [[logistic regression]] and [[multilayer perceptron]]s (when trained under an appropriate [[loss function]]) are naturally probabilistic. Other models such as [[support vector machine]]s are not, but [[#Probability calibration\|methods exist]] to turn them into probabilistic classifiers. ==Generative and conditional training== Some models, such as [[logistic regression]], are conditionally trained: they optimize the conditional probability <math>\Pr(Y \vert X)</math> directly on a training set (see [[empirical risk minimization]]). Other classifiers, such as [[naive Bayes]], are trained [[Generative model\|generatively]]: at training time, the class-conditional distribution <math>\Pr(X \vert Y)</math> and the class [[Prior probability\|prior]] <math>\Pr(Y)</math> are found, and the conditional distribution <math>\Pr (Y \vert X)</math> is derived using [[Bayes' theorem\|Bayes' rule]].<ref name="bishop"/>{{rp\|43}} ==Probability calibration== {{Main article\|Calibration (statistics)}} Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers, [[decision tree learning\|decision trees]] and [[Boosting (machine learning)\|boosting]] methods, produce distorted class probability distributions.<ref name="Niculescu">{{cite conference \|last1=Niculescu-Mizil \|first1=Alexandru \|first2=Rich \|last2=Caruana \|title=Predicting good probabilities with supervised learning \|conference=ICML \|year=2005 \|url=http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf \| doi = 10.1145/1102351.1102430 \|url-status=dead \|archive-url=https://web.archive.org/web/20140311005243/http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf \|archive-date=2014-03-11 }}</ref> In the case of decision trees, where {{math\|Pr(''y''{{!}}'''x''')}} is the proportion of training samples with label {{mvar\|y}} in the leaf where {{math\|'''x'''}} ends up, these distortions come about because learning algorithms such as [[C4.5]] or [[Predictive analytics#Classification and regression trees\|CART]] explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high [[Bias of an estimator\|bias]]) while using few samples to estimate the relevant proportion (high [[Bias–variance tradeoff\|variance]]).<ref>{{cite conference \|first1=Bianca \|last1=Zadrozny \|first2=Charles \|last2=Elkan \|title=Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers \|url=http://cseweb.ucsd.edu/~elkan/calibrated.pdf \|year=2001 \|conference=ICML \|pages=609–616}}</ref> [[File:Calibration plot.png\|thumb\|An example calibration plot]] Calibration can be assessed using a '''calibration plot''' (also called a '''reliability diagram''').<ref name="Niculescu" /><ref>{{Cite web\|url=https://jmetzen.github.io/2015-04-14/calibration.html\|title=Probability calibration\|website=jmetzen.github.io\|access-date=2019-06-18}}</ref> A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly [[Calibration (statistics)\|calibrated]] class membership probabilities. For the [[binary classification\|binary]] case, a common approach is to apply [[Platt scaling]], which learns a [[logistic regression]] model on the scores.<ref name="platt99">{{cite journal \|last=Platt \|first=John \|~~authorlink~~author-link=John Platt (computer scientist) \|title=Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods \|journal=Advances in ~~large~~Large ~~margin~~Margin ~~classifiers~~Classifiers \|volume=10 \|issue=3 \|year=1999 \|pages=61–74 \|url=~~http~~https://www.researchgate.net/publication/~~2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods/file/504635154cff5262d6.pdf~~2594015}}</ref> An alternative method using [[isotonic regression]]<ref>{{Cite book \| last1 = Zadrozny \| first1 = Bianca\| last2 = Elkan \| first2 = Charles\| doi = 10.1145/775047.775151 \| chapter = Transforming classifier scores into accurate multiclass probability estimates \| chapter-url = ~~http~~https://www.cs.cornell.edu/courses/cs678/2007sp/ZadroznyElkan.pdf\| title = Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02 \| pages = 694–699\| year = 2002 \| isbn = 978-1-58113-567-~~X\| pmid = \| pmc =~~ 1\| id = [[CiteSeerX]]: {{URL\|1=citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7457\|2=10.1.1.13.7457}}\| citeseerx = 10.1.1.164.8140\| s2cid = 3349576}}</ref> is generally superior to Platt's method when sufficient training data is available.<ref name="Niculescu"/> In the [[multiclass classification\|multiclass]] case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.<ref>{{Cite journal \| last1 = Hastie \| first1 = Trevor\| last2 = Tibshirani \| first2 = Robert\| doi = 10.1214/aos/1028144844 \| title = Classification by pairwise coupling \| journal = [[The Annals of Statistics]] \| volume = 26 \| issue = 2 \| pages = 451–471\| year = 1998 ~~\| pmid = \| pmc =~~ \| zbl = 0932.62071\| id = [[CiteSeerX]]: {{URL\|1=citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.6032\|2=10.1.1.46.6032}}\| citeseerx = 10.1.1.309.4720 }}</ref> ==Evaluating probabilistic classification== Commonly used ~~loss~~evaluation ~~functions~~metrics ~~for~~that compare the predicted probability to ~~probabilistic~~observed ~~classification~~outcomes include [[log loss]] ~~and the~~, [[~~mean~~Brier ~~squared error~~score]] ~~between the predicted~~, and ~~the~~a ~~true~~variety ~~probability~~of calibration ~~distributions~~errors. The former ~~of these~~ is ~~commonly~~also used toas ~~train~~a loss function in the training of logistic models. Calibration errors metrics aim to quantify the extent to which a probabilistic classifier's outputs are ''well-calibrated''. As [[Philip Dawid]] put it, "a forecaster is well-calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent".<ref>{{cite journal \|doi=10.1080/01621459.1982.10477856 \|title=The Well-Calibrated Bayesian \|journal=Journal of the American Statistical Association \|volume=77 \|issue=379 \|pages=605–610 \|year=1982 \|last1=Dawid \|first1=A. P}}</ref> Foundational work in the ___domain of measuring calibration error is the Expected Calibration Error (ECE) metric.<ref>{{cite book \| first1 = M.P. \| last1= Naeini \| first2 = G. \| last2 = Cooper\| first3 = M. \| last3 = Hauskrecht\| chapter = Obtaining well calibrated probabilities using bayesian binning \|title = Proceedings of the AAAI Conference on Artificial Intelligence \| year = 2015 \| chapter-url = https://www.dbmi.pitt.edu/wp-content/uploads/2022/10/Obtaining-well-calibrated-probabilities-using-Bayesian-binning.pdf}}</ref> More recent works propose variants to ECE that address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1], including the Adaptive Calibration Error (ACE) <ref>{{cite book \| first1 = J. \| last1= Nixon \| first2 = M.W. \| last2 = Dusenberry\| first3 = L. \| last3 = Zhang\| first4= G. \| last4 = Jerfel \| first5 = D. \| last5 = Tran \| chapter = Measuring Calibration in Deep Learning \|title = CVPR workshops \| year = 2019 \| chapter-url = https://openaccess.thecvf.com/content_CVPRW_2019/papers/Uncertainty%20and%20Robustness%20in%20Deep%20Visual%20Learning/Nixon_Measuring_Calibration_in_Deep_Learning_CVPRW_2019_paper.pdf}}</ref> and Test-based Calibration Error (TCE).<ref>{{cite book \| first1 = T. \| last1= Matsubara \| first2 = N. \| last2 = Tax\| first3 = R. \| last3 = Mudd\| first4= I. \| last4 = Guy \| chapter = TCE: A Test-Based Approach to Measuring Calibration Error \|title = Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI) \| year = 2023 \| arxiv= 2306.14343 }}</ref> A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a [[scoring rule]]. ==Software Implementations== * MoRPE<ref>{{cite web \|title=MoRPE \|url=https://github.com/adaviding/morpe \|website=GitHub \|access-date=17 February 2023}}</ref> is a trainable probabilistic classifier that uses [[isotonic regression]] for probability calibration. It solves the [[multiclass classification\|multiclass]] case by reduction to binary tasks. It is a type of kernel machine that uses an inhomogeneous polynomial kernel. ==References==