Discretization of continuous features: Difference between revisions

Content deleted Content added
Bgeelhoed (talk | contribs)
Undid revision 376936250 by EverGreg (talk) perhaps promotion, but paper is relevant
m top: clean up spacing around commas and other punctuation fixes, replaced: ; → ;
 
(37 intermediate revisions by 27 users not shown)
Line 1:
In [[statistics]] and [[machine learning]], '''discretization''' refers to the process of converting or partitioning continuous [[Variable (statistics)#Applied statistics|attributes]], [[Features (pattern recognition)|features]] or [[Dependent and independent variables|variables]] to discretized or [[nominal data|nominal]] attributes/features/variables/[[Interval (mathematics)|intervals]]. This can be useful when creating probability mass functions – formally, in [[density estimation]]. It is a form of [[discretization]] in general and also of [[data binning|binning]], as in making a [[histogram]]. Whenever [[continuous function|continuous]] data is discretized, there is always some amount of [[discretization error]]. The goal is to reduce the amount to a level considered [[wikt:negligible|negligible]] for the [[conceptual model|modeling]] purposes at hand.
 
Typically data is discretized into partitions of ''K'' equal lengths/width (equal intervals) or K% of the total data (equal frequencies).<ref name=clarke> {{citeCite web|url=http://sci2s.ugr.es/keel/pdf/specific/articulo/IJIS00.pdfjournal |title=Entropy and MDL Discretization of Continuous Variables for Bayesian Belief Networks |accessdate=2008-07-10 }}</ref>
| last1 = Clarke | first1 = E. J.
| last2 = Barton | first2 = B. A.
| doi = 10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O
| title = Entropy and MDL discretization of continuous variables for Bayesian belief networks
| journal = International Journal of Intelligent Systems
| volume = 15
| pages = 61–92
| year = 2000
| pmid =
| pmc =
| url=http://sci2s.ugr.es/keel/pdf/specific/articulo/IJIS00.pdf |accessdate=2008-07-10
}}
</ref>
 
Mechanisms for discretizing continuous data include [[Usama Fayyad|Fayyad]] & Irani's MDL method,<ref>Fayyad, Usama M.; Irani, Keki B. (1993) {{cite web|hdl=2014/35171 | url = https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf | title = Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning| date = 29 July 2023 }}, ''Proc. 13th Int. Joint Conf. on Artificial Intelligence'' (Q334 .I571 1993), pp. 1022-1027</ref> which uses [[mutual information]] to recursively define the best bins, CAIM, CACC, Ameva, and many others<ref>Dougherty, J.; Kohavi, R.; Sahami, M. (1995). "[http://robotics.stanford.edu/users/sahami/papers-dir/disc.pdf Supervised and Unsupervised Discretization of Continuous Features]". In A. Prieditis & S. J. Russell, eds. ''Work''. Morgan Kaufmann, pp. 194-202</ref>
Some mechanisms for discretizing continuous data include:
 
Many Machinemachine Learninglearning algorithms are known to produce better models by discretizing continuous attributes .<ref>{{cite webjournal|url=http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/discretization%20survey%20kotsiantis.pdf | titlefirst1=S. |last1=Kotsiantis, |first2= D.| last2= Kanellopoulos, |title=Discretization Techniques: A recent survey,|journal= GESTS International Transactions on Computer Science and Engineering, Vol.|volume=32 (|issue=1), |year=2006, pp.|pages= 47-5847–58|citeseerx = 10.1.1.109.3084}}</ref>
* Fayyad & Irani's MDL method <ref>{{cite web|url=http://hdl.handle.net/2014/35171 |title=Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning |accessdate=2008-07-10 }}</ref> - Uses [[Information_gain|Information Gain]] to recursively define the best bins.
 
* And many more <ref>{{cite web|url=http://www.ifir.edu.ar/~redes/curso/disc.ps |title=Supervised and Unsupervised Discretization of Continuous Features |accessdate=2008-07-10 }}</ref>
== Software ==
Many Machine Learning algorithms are known to produce better models by discretizing continuous attributes <ref>{{cite web|url=http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/discretization%20survey%20kotsiantis.pdf | title=S. Kotsiantis, D. Kanellopoulos, Discretization Techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 47-58.}}</ref>
This is a partial list of software that implement MDL algorithm.
* [https://gforge.inria.fr/projects/discretize4crf discretize4crf] tool designed to work with popular [[Conditional random field|CRF]] implementations ([[C++]])
* [https://cran.r-project.org/web/packages/discretization/discretization.pdf mdlp] in the R package discretization
* [https://cran.r-project.org/web/packages/RWeka/RWeka.pdf Discretize] in the R package RWeka
 
== See also ==
* [[Data binning]]
* [[Density estimation]]
* [[Discretization error]]
* [[Histogram]]
* [[Continuity correction]]
 
== References ==
<references/>
 
[[Category:Estimation of densities]]
[[Category:Statistical terminologydata coding]]
 
 
[[Category:Statistical terminology]]
{{statistics-stub}}