Content deleted Content added
→Probability calibration: remove Gebel and Weihs method; only one citation on GScholar, and in fact the first page contains several obvious errors (ANNs are not more margin-based than LR and are not typically binary classifiers) |
m →Probability calibration: HTTP to HTTPS for Cornell University |
||
(46 intermediate revisions by 34 users not shown) | |||
Line 1:
{{Short description|Machine learning problem}}
{{machine learning bar}}
In [[machine learning]], a '''probabilistic classifier''' is a [[statistical classification|classifier]] that is able to predict, given
==Types of classification==
Formally, an "ordinary" classifier is some rule, or [[function (mathematics)|function]], that assigns to a sample {{mvar|x}} a class label {{mvar|ŷ}}:
:<math>\hat{y} = f(x)</math>
The samples come from some set {{mvar|X}} (e.g., the set of all [[document classification|documents]], or the set of all [[Computer vision#
Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are [[conditional probability|conditional]] distributions <math>\
:<math>\hat{y} = \operatorname{\arg\max}_{y} \
or, in English, the predicted class is that which has the highest probability.
Binary probabilistic classifiers are also called [[
Some classification models, such as [[naive Bayes classifier|naive Bayes]], [[logistic regression]] and [[multilayer perceptron]]s (when trained under an appropriate [[loss function]]) are naturally probabilistic. Other models such as [[support vector machine]]s are not, but [[#Probability calibration|methods exist]] to turn them into probabilistic classifiers.
==Generative and conditional training==
Some models, such as [[logistic regression]], are conditionally trained: they optimize the conditional probability <math>\
==Probability calibration==
{{Main article|Calibration (statistics)}}
Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers, [[decision tree learning|decision trees]] and [[Boosting (machine learning)|boosting]] methods, produce distorted class probability distributions.<ref name="Niculescu">{{cite conference |last1=Niculescu-Mizil |first1=Alexandru |first2=Rich |last2=Caruana |title=Predicting good probabilities with supervised learning |conference=ICML |year=2005 |url=http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |doi=10.1145/1102351.1102430 |url-status=dead |archive-url=https://web.archive.org/web/20140311005243/http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |archive-date=2014-03-11 }}</ref> In the case of decision trees, where {{math|Pr(''y''{{!}}'''x''')}} is the proportion of training samples with label {{mvar|y}} in the leaf where {{math|'''x'''}} ends up, these distortions come about because learning algorithms such as [[C4.5]] or [[Predictive analytics#Classification and regression trees|CART]] explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high [[Bias of an estimator|bias]]) while using few samples to estimate the relevant proportion (high [[Bias–variance tradeoff|variance]]).<ref>{{cite conference |first1=Bianca |last1=Zadrozny |first2=Charles |last2=Elkan |title=Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers |url=http://cseweb.ucsd.edu/~elkan/calibrated.pdf |year=2001 |conference=ICML |pages=609–616}}</ref>
[[File:Calibration plot.png|thumb|An example calibration plot]] Calibration can be assessed using a '''calibration plot''' (also called a '''reliability diagram''').<ref name="Niculescu" /><ref>{{Cite web|url=https://jmetzen.github.io/2015-04-14/calibration.html|title=Probability calibration|website=jmetzen.github.io|access-date=2019-06-18}}</ref> A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly [[Calibration (statistics)|calibrated]] class membership probabilities.
For the [[binary classification|binary]] case, a common approach is to apply [[Platt scaling]], which learns a [[logistic regression]] model on the scores.<ref name="platt99">{{cite journal |last=Platt |first=John |authorlink=John Platt (computer scientist) |title=Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods |journal=Advances in large margin classifiers |volume=10 |issue=3 |year=1999 |pages=61–74 |url=http://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods/file/504635154cff5262d6.pdf}}</ref>▼
▲For the [[binary classification|binary]] case, a common approach is to apply [[Platt scaling]], which learns a [[logistic regression]] model on the scores.<ref name="platt99">{{cite journal |last=Platt |first=John |
In the [[multiclass classification|multiclass]] case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.<ref>{{cite doi|10.1214/aos/1028144844}}</ref>▼
An alternative method using [[isotonic regression]]<ref>{{Cite book | last1 = Zadrozny | first1 = Bianca| last2 = Elkan | first2 = Charles| doi = 10.1145/775047.775151 | chapter = Transforming classifier scores into accurate multiclass probability estimates | chapter-url = https://www.cs.cornell.edu/courses/cs678/2007sp/ZadroznyElkan.pdf| title = Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02 | pages = 694–699| year = 2002 | isbn = 978-1-58113-567-1| id = [[CiteSeerX]]: {{URL|1=citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7457|2=10.1.1.13.7457}}| citeseerx = 10.1.1.164.8140| s2cid = 3349576}}</ref> is generally superior to Platt's method when sufficient training data is available.<ref name="Niculescu"/>
▲In the [[multiclass classification|multiclass]] case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.<ref>{{
==Evaluating probabilistic classification==
Commonly used
Calibration errors metrics aim to quantify the extent to which a probabilistic classifier's outputs are ''well-calibrated''. As [[Philip Dawid]] put it, "a forecaster is well-calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent".<ref>{{cite journal |doi=10.1080/01621459.1982.10477856 |title=The Well-Calibrated Bayesian |journal=Journal of the American Statistical Association |volume=77 |issue=379 |pages=605–610 |year=1982 |last1=Dawid |first1=A. P}}</ref> Foundational work in the ___domain of measuring calibration error is the Expected Calibration Error (ECE) metric.<ref>{{cite book | first1 = M.P. | last1= Naeini | first2 = G. | last2 = Cooper| first3 = M. | last3 = Hauskrecht| chapter = Obtaining well calibrated probabilities using bayesian binning |title = Proceedings of the AAAI Conference on Artificial Intelligence | year = 2015 | chapter-url = https://www.dbmi.pitt.edu/wp-content/uploads/2022/10/Obtaining-well-calibrated-probabilities-using-Bayesian-binning.pdf}}</ref> More recent works propose variants to ECE that address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1], including the Adaptive Calibration Error (ACE) <ref>{{cite book | first1 = J. | last1= Nixon | first2 = M.W. | last2 = Dusenberry| first3 = L. | last3 = Zhang| first4= G. | last4 = Jerfel | first5 = D. | last5 = Tran | chapter = Measuring Calibration in Deep Learning |title = CVPR workshops | year = 2019 | chapter-url = https://openaccess.thecvf.com/content_CVPRW_2019/papers/Uncertainty%20and%20Robustness%20in%20Deep%20Visual%20Learning/Nixon_Measuring_Calibration_in_Deep_Learning_CVPRW_2019_paper.pdf}}</ref> and Test-based Calibration Error (TCE).<ref>{{cite book | first1 = T. | last1= Matsubara | first2 = N. | last2 = Tax| first3 = R. | last3 = Mudd| first4= I. | last4 = Guy | chapter = TCE: A Test-Based Approach to Measuring Calibration Error |title = Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI) | year = 2023 | arxiv= 2306.14343 }}</ref>
A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a [[scoring rule]].
==Software Implementations==
* MoRPE<ref>{{cite web |title=MoRPE |url=https://github.com/adaviding/morpe |website=GitHub |access-date=17 February 2023}}</ref> is a trainable probabilistic classifier that uses [[isotonic regression]] for probability calibration. It solves the [[multiclass classification|multiclass]] case by reduction to binary tasks. It is a type of kernel machine that uses an inhomogeneous polynomial kernel.
==References==
{{reflist|30em}}
[[Category:Probabilistic models]]
|