Decision tree learning: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Add: bibcode. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 | #UCB_webform_linked 734/3179
m task, replaced: IEEE Transactions on Systems, Man and Cybernetics → IEEE Transactions on Systems, Man, and Cybernetics (2)
Line 1:
{{about|decision trees in machine learning|the use of the term in decision analysis|Decision tree}}
{{short description|Machine learning algorithm}}
{{about|decision trees in machine learning|the use of the term in decision analysis|Decision tree}}
{{Machine learning|Supervised learning}}
 
Line 75:
* '''Rotation forest''' – in which every decision tree is trained by first applying [[principal component analysis]] (PCA) on a random subset of the input features.<ref>{{cite journal |last1=Rodriguez |first1=J. J. |last2=Kuncheva |first2=L. I.|author2-link=Ludmila Kuncheva |last3=Alonso |first3=C. J. |year=2006 |title=Rotation forest: A new classifier ensemble method |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=28 |issue=10 |pages=1619–1630 |doi=10.1109/TPAMI.2006.211 |pmid=16986543 |citeseerx=10.1.1.156.8277 |s2cid=6847493 }}</ref>
 
A special case of a decision tree is a [[decision list]],<ref>{{cite journal|last1=Rivest|first1=Ron|title=Learning Decision Lists|journal=Machine Learning|date=Nov 1987|volume=3|issue=2|pages=229–246|doi=10.1023/A:1022607331053|s2cid=30625841|url=http://people.csail.mit.edu/rivest/pubs/Riv87b.pdf|doi-access=free}}</ref> which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity{{cncitation needed|date=December 2021}}, permit non-greedy learning methods<ref>{{cite journal|last1=Letham|first1=Ben|last2=Rudin|first2=Cynthia|author2-link= Cynthia Rudin |last3=McCormick|first3=Tyler|last4=Madigan|first4=David|title=Interpretable Classifiers Using Rules And Bayesian Analysis: Building A Better Stroke Prediction Model|journal=Annals of Applied Statistics|date=2015|volume=9|issue=3|pages=1350–1371|doi=10.1214/15-AOAS848|arxiv=1511.01644|s2cid=17699665}}</ref> and monotonic constraints to be imposed.<ref>{{cite journal|last1=Wang|first1=Fulton|last2=Rudin|first2=Cynthia|author2-link=Cynthia Rudin|title=Falling Rule Lists|journal=Journal of Machine Learning Research|date=2015|volume=38|url=http://www.jmlr.org/proceedings/papers/v38/wang15a.pdf|access-date=2016-01-22|archive-date=2016-01-28|archive-url=https://web.archive.org/web/20160128223950/http://www.jmlr.org/proceedings/papers/v38/wang15a.pdf|url-status=dead}}</ref>
 
Notable decision tree algorithms include:
Line 87:
ID3 and CART were invented independently at around the same time (between 1970 and 1980){{Citation needed|date=August 2014}}, yet follow a similar approach for learning a decision tree from training tuples.
 
It has also been proposed to leverage concepts of [[fuzzy set theory]] for the definition of a special version of decision tree, known as Fuzzy Decision Tree (FDT).<ref name="Janikow1998">{{Cite journal | doi=10.1109/3477.658573| title=Fuzzy decision trees: issues and methods| journal=IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)| volume=28| pages=1–14| year=1998| last1=Janikow| first1=C. Z.| issue=1| pmid=18255917}}</ref>
In this type of fuzzy classification, generally, an input vector <math>\textbf{x}</math> is associated with multiple classes, each with a different confidence value.
Boosted ensembles of FDTs have been recently investigated as well, and they have shown performances comparable to those of other very efficient fuzzy classifiers.<ref name="Barsacchi2020">{{Cite journal | url=http://www.sciencedirect.com/science/article/pii/S0957417420302608 | doi=10.1016/j.eswa.2020.113436|
Line 315:
* Decision-tree learners can create over-complex trees that do not generalize well from the training data. (This is known as [[overfitting]].<ref>{{Cite book | title = Principles of Data Mining | doi = 10.1007/978-1-84628-766-4 | year = 2007 | isbn = 978-1-84628-765-7 }}</ref>) Mechanisms such as [[Pruning (decision trees)|pruning]] are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning).<ref name="Hothorn2006" /><ref name="Strobl2009" />
* The average depth of the tree that is defined by the number of nodes or tests till classification is not guaranteed to be minimal or small under various splitting criteria.<ref name="Tris">{{cite web|author = Ben-Gal I. and Trister C. (2015)|title = Parallel Construction of Decision Trees with Consistently Non Increasing Expected Number of Tests |url = http://www.eng.tau.ac.il/~bengal/Trist.pdf|publisher = Applied Stochastic Models in Business and Industry, Vol. 31(1) 64-78}}</ref>
* For data including categorical variables with different numbers of levels, [[information gain in decision trees]] is biased in favor of attributes with more levels.<ref>{{cite conference|author=Deng, H.|author2=Runger, G. |author3=Tuv, E. |title=Bias of importance measures for multi-valued attributes and solutions|conference=Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN)|year=2011|pages= 293–300|url=https://www.researchgate.net/publication/221079908}}</ref> To counter this problem, instead of choosing the attribute with highest [[information gain]], one can choose the attribute with the highest [[information gain ratio]] among the attributes whose information gain is greater than the mean information gain. <ref>{{cite journal |doi=10.1007/BF00116251 |last=Quinlan |first=J. Ross |title=Induction of Decision Trees |journal=[[Machine Learning (journal)|Machine Learning]] |volume=1 |issue=1 |year=1986 |pages=81–106 |doi-access=free }}</ref> This biases the decision tree against considering attributes with a large number of distinct values, while not giving an unfair advantage to attributes with very low information gain. Alternatively, the issue of biased predictor selection can be avoided by the Conditional Inference approach,<ref name="Hothorn2006" /> a two-stage approach,<ref>{{Cite journal|last1=Brandmaier|first1=Andreas M.|last2=Oertzen|first2=Timo von|last3=McArdle|first3=John J.|last4=Lindenberger|first4=Ulman|title=Structural equation model trees.|journal=Psychological Methods|language=en|volume=18|issue=1|pages=71–86|doi=10.1037/a0030001|pmid=22984789|pmc=4386908|year=2012|hdl=11858/00-001M-0000-0024-EA33-9}}</ref> or adaptive leave-one-out feature selection.<ref>{{cite journal|last1=Painsky|first1=Amichai|last2=Rosset|first2=Saharon|title=Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=2017|volume=39|issue=11|pages=2142–2153|pmid=28114007|doi=10.1109/TPAMI.2016.2636831|arxiv=1512.03444|s2cid=5381516}}</ref>
 
===Implementations===
Line 339:
 
===Alternative search methods===
Evolutionary algorithms have been used to avoid local optimal decisions and search the decision tree space with little ''a priori'' bias.<ref>{{cite book |last1=Papagelis |first1=A. |last2=Kalles |first2=D. |year=2001 |chapter=Breeding Decision Trees Using Evolutionary Techniques |title=Proceedings of the Eighteenth International Conference on Machine Learning, June 28–July 1, 2001 |pages=393–400 |chapter-url=http://www.gatree.com/wordpress/wp-content/uploads/2010/04/BreedinDecisioTreeUsinEvo.pdf }}</ref><ref>{{cite journal |last1=Barros |first1=Rodrigo C. |last2=Basgalupp |first2=M. P. |last3=Carvalho |first3=A. C. P. L. F. |last4=Freitas |first4=Alex A. |year=2012 |doi=10.1109/TSMCC.2011.2157494 |title=A Survey of Evolutionary Algorithms for Decision-Tree Induction |journal=IEEE Transactions on Systems, Man, and Cybernetics |series=Part C: Applications and Reviews |volume=42 |issue=3 |pages=291–312 |citeseerx=10.1.1.308.9068 |s2cid=365692 }}</ref>
 
It is also possible for a tree to be sampled using [[Markov chain Monte Carlo|MCMC]].<ref>{{cite journal |last1=Chipman |first1=Hugh A. |first2=Edward I. |last2=George |first3=Robert E. |last3=McCulloch |title=Bayesian CART model search |journal=Journal of the American Statistical Association |volume=93 |issue=443 |year=1998 |pages=935–948 |doi=10.1080/01621459.1998.10473750 |citeseerx=10.1.1.211.5573 }}</ref>
 
The tree can be searched for in a bottom-up fashion.<ref>{{cite book |last1=Barros |first1=R. C. |last2=Cerri |first2=R. |last3=Jaskowiak |first3=P. A. |last4=Carvalho |first4=A. C. P. L. F. |doi=10.1109/ISDA.2011.6121697 |chapter=A bottom-up oblique decision tree induction algorithm |title=Proceedings of the 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011) |pages=450–456 |year=2011 |isbn=978-1-4577-1676-8 |s2cid=15574923 }}</ref> Or several trees can be constructed parallelly to reduce the expected number of tests till classification.<ref name="Tris">< /ref>
 
==See also==