Discretization of continuous features: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:09, 3 August 2010 edit Bgeelhoed (talk \| contribs) 207 edits Undid revision 376936250 by EverGreg (talk) perhaps promotion, but paper is relevant ← Previous edit		Latest revision as of 19:07, 17 January 2024 edit undo BD2412 (talk \| contribs) Autopatrolled, Administrators 2,529,125 edits m →top: clean up spacing around commas and other punctuation fixes, replaced: ; → ; Tag: AWB
(37 intermediate revisions by 27 users not shown)
Line 1: In [[statistics]] and [[machine learning]], '''discretization''' refers to the process of converting or partitioning continuous [[Variable (statistics)#Applied statistics\|attributes]], [[Features (pattern recognition)\|features]] or [[Dependent and independent variables\|variables]] to discretized or [[nominal data\|nominal]] attributes/features/variables/[[Interval (mathematics)\|intervals]]. This can be useful when creating probability mass functions – formally, in [[density estimation]]. It is a form of [[discretization]] in general and also of [[data binning\|binning]], as in making a [[histogram]]. Whenever [[continuous function\|continuous]] data is discretized, there is always some amount of [[discretization error]]. The goal is to reduce the amount to a level considered [[wikt:negligible\|negligible]] for the [[conceptual model\|modeling]] purposes at hand. Typically data is discretized into partitions of ''K'' equal lengths/width (equal intervals) or K% of the total data (equal frequencies).<ref name=clarke> {{~~cite~~Cite ~~web\|url=http://sci2s.ugr.es/keel/pdf/specific/articulo/IJIS00.pdf~~journal ~~\|title=Entropy and MDL Discretization of Continuous Variables for Bayesian Belief Networks \|accessdate=2008-07-10 }}</ref>~~ \| last1 = Clarke \| first1 = E. J. \| last2 = Barton \| first2 = B. A. \| doi = 10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O \| title = Entropy and MDL discretization of continuous variables for Bayesian belief networks \| journal = International Journal of Intelligent Systems \| volume = 15 \| pages = 61–92 \| year = 2000 \| pmid = \| pmc = \| url=http://sci2s.ugr.es/keel/pdf/specific/articulo/IJIS00.pdf \|accessdate=2008-07-10 }} </ref> Mechanisms for discretizing continuous data include [[Usama Fayyad\|Fayyad]] & Irani's MDL method,<ref>Fayyad, Usama M.; Irani, Keki B. (1993) {{cite web\|hdl=2014/35171 \| url = https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf \| title = Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning\| date = 29 July 2023 }}, ''Proc. 13th Int. Joint Conf. on Artificial Intelligence'' (Q334 .I571 1993), pp. 1022-1027</ref> which uses [[mutual information]] to recursively define the best bins, CAIM, CACC, Ameva, and many others<ref>Dougherty, J.; Kohavi, R.; Sahami, M. (1995). "[http://robotics.stanford.edu/users/sahami/papers-dir/disc.pdf Supervised and Unsupervised Discretization of Continuous Features]". In A. Prieditis & S. J. Russell, eds. ''Work''. Morgan Kaufmann, pp. 194-202</ref> ~~Some mechanisms for discretizing continuous data include:~~ Many ~~Machine~~machine ~~Learning~~learning algorithms are known to produce better models by discretizing continuous attributes .<ref>{{cite ~~web~~journal\|~~url=http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/discretization%20survey%20kotsiantis.pdf~~ ~~\| title~~first1=S. \|last1=Kotsiantis, \|first2= D.\| last2= Kanellopoulos, \|title=Discretization Techniques: A recent survey,\|journal= GESTS International Transactions on Computer Science and Engineering, ~~Vol.~~\|volume=32 (\|issue=1), \|year=2006, ~~pp.~~\|pages= ~~47-58~~47–58\|citeseerx = 10.1.1.109.3084}}</ref>▼ * Fayyad & Irani's MDL method <ref>{{cite web\|url=http://hdl.handle.net/2014/35171 \|title=Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning \|accessdate=2008-07-10 }}</ref> - Uses [[Information_gain\|Information Gain]] to recursively define the best bins. * And many more <ref>{{cite web\|url=http://www.ifir.edu.ar/~redes/curso/disc.ps \|title=Supervised and Unsupervised Discretization of Continuous Features \|accessdate=2008-07-10 }}</ref> == Software == ▲Many Machine Learning algorithms are known to produce better models by discretizing continuous attributes <ref>{{cite web\|url=http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/discretization%20survey%20kotsiantis.pdf \| title=S. Kotsiantis, D. Kanellopoulos, Discretization Techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 47-58.}}</ref> This is a partial list of software that implement MDL algorithm. * [https://gforge.inria.fr/projects/discretize4crf discretize4crf] tool designed to work with popular [[Conditional random field\|CRF]] implementations ([[C++]]) * [https://cran.r-project.org/web/packages/discretization/discretization.pdf mdlp] in the R package discretization * [https://cran.r-project.org/web/packages/RWeka/RWeka.pdf Discretize] in the R package RWeka == See also == * [[Data binning]] * [[Density estimation]] * [[Discretization error]] * [[Histogram]] * [[Continuity correction]] == References == <references/> [[Category:Estimation of densities]] [[Category:Statistical ~~terminology~~data coding]]▼ ▲[[Category:Statistical terminology]] {{statistics-stub}}