Decision tree learning: Difference between revisions

Content deleted Content added
Restored revision 1155257143 by MrOllie (talk)
Citation bot (talk | contribs)
Add: bibcode. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 | #UCB_webform_linked 734/3179
Line 81:
* [[C4.5 algorithm|C4.5]] (successor of ID3)
* [[Predictive analytics#Classification and regression trees (CART)|CART]] (Classification And Regression Tree)<ref name="bfos" />
* [[Chi-square automatic interaction detection]] (CHAID). Performs multi-level splits when computing classification trees.<ref>{{Cite journal | doi = 10.2307/2986296 | last1 = Kass | first1 = G. V. | year = 1980 | title = An exploratory technique for investigating large quantities of categorical data | jstor = 2986296| journal = Applied Statistics | volume = 29 | issue = 2| pages = 119–127 }}</ref><ref>{{Cite journal|last1=Biggs|first1=David|last2=De Ville|first2=Barry|last3=Suen|first3=Ed|date=1991|title=A method of choosing multiway partitions for classification and decision trees|url=https://doi.org/10.1080/02664769100000005|journal=Journal of Applied Statistics|volume=18|issue=1|pages=49–62|doi=10.1080/02664769100000005|bibcode=1991JApSt..18...49B |issn=0266-4763}}</ref><ref>Ritschard, G. (2013), '''"'''CHAID and Earlier Supervised Tree Methods", in J.J. McArdle and G. Ritschard (eds), ''Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences'', Quantitative Methodology Series, New York: Routledge, pages 48-74. [https://www.researchgate.net/publication/315476407_CHAID_and_Earlier_Supervised_Tree_Methods Preprint]</ref>
* [[Multivariate adaptive regression splines|MARS]]: extends decision trees to handle numerical data better.
* Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.<ref name="Hothorn2006">{{Cite journal | doi = 10.1198/106186006X133933 | last1 = Hothorn | first1 = T.| last2 = Hornik | first2 = K.| last3 = Zeileis | first3 = A.| year = 2006 | title = Unbiased Recursive Partitioning: A Conditional Inference Framework | jstor = 27594202| journal = Journal of Computational and Graphical Statistics | volume = 15 | issue = 3| pages = 651–674 | citeseerx = 10.1.1.527.2935 | s2cid = 6074128 }}</ref><ref name="Strobl2009">{{Cite journal | doi = 10.1037/a0016973 | pmid = 19968396 | pmc = 2927982 | last1 = Strobl | first1 = C.| last2 = Malley | first2 = J.| last3 = Tutz | first3 = G.| year = 2009 | title = An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests | journal = Psychological Methods | volume = 14 | issue = 4| pages = 323–348 }}</ref>
Line 166:
 
===Gini impurity===
'''Gini impurity''', '''Gini's diversity index''',<ref>{{cite web |title=Growing Decision Trees |url=https://www.mathworks.com/help/stats/growing-decision-trees.html |website=MathWorks |publisher=MathWorks}}</ref> or '''[[Diversity index#Gini–Simpson index|Gini-Simpson Index]]''' in biodiversity research, is named after Italian mathematician [[Corrado Gini]] and used by the CART (classification and regression tree) algorithm for classification trees. Gini impurity measures how often a randomly chosen element of a set would be incorrectly labeled if it was labeled randomly and independently according to the distribution of labels in the set. It reaches its minimum (zero) when all cases in the node fall into a single target category.
 
For a set of items with <math>J</math> classes and relative frequencies <math>p_i</math>, <math>i \in \{1, 2, ...,J\}</math>, the probability of choosing an item with label <math>i</math> is <math>p_i</math>, and the probability of miscategorizing that item is <math>\sum_{k \ne i} p_k = 1-p_i</math>. The Gini impurity is computed by summing pairwise products of these probabilities for each class label: