Decision tree learning: Difference between revisions

Content deleted Content added
Alter: pmc. Add: doi-access, s2cid, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this tool. Report bugs. | #UCB_Gadget
AVM2019 (talk | contribs)
Gini impurity: rephrase to remove repetitions
Line 166:
 
===Gini impurity===
'''Gini impurity''', '''Gini's diversity index''',<ref>{{cite web |title=Growing Decision Trees |url=https://www.mathworks.com/help/stats/growing-decision-trees.html |website=MathWorks |publisher=MathWorks}}</ref> or [[Diversity index#Gini–Simpson index|Gini-Simpson Index]] in biodiversity research, is used by the CART (classification and regression tree) algorithm for classification trees, Gini impurity (named after Italian mathematician [[Corrado Gini]]) is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability <math>p_i</math> of an item with label <math>i</math> being chosen times the probability <math>\sum_{k \ne i} p_k = 1-p_i</math> of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.
 
For a set of items with <math>J</math> classes and relative frequencies <math>p_i</math>, <math>i \in \{1, 2, ...,J\}</math>, the probability of choosing an item with label <math>i</math> is <math>p_i</math>, and the probability of miscategorizing that item is <math>\sum_{k \ne i} p_k = 1-p_i</math>. The Gini impurity is computed by summing pairwise products of these probabilities for each class label:
The Gini impurity is also an information theoretic measure and corresponds to [[Tsallis Entropy]] with deformation coefficient <math>q=2</math>, which in physics is associated with the lack of information in out-of-equilibrium, non-extensive, dissipative and quantum systems. For the limit <math>q\to 1</math> one recovers the usual Boltzmann-Gibbs or Shannon entropy. In this sense, the Gini impurity is nothing but a variation of the usual entropy measure for decision trees.
 
To compute Gini impurity for a set of items with <math>J</math> classes, suppose <math>i \in \{1, 2, ...,J\}</math>, and let <math>p_i</math> be the fraction of items labeled with class <math>i</math> in the set.
 
:<math>\operatorname{I}_G(p) = \sum_{i=1}^J \left( p_i \sum_{k\neq i} p_k \right)
Line 176 ⟶ 174:
= \sum_{i=1}^J (p_i - p_i^2)
= \sum_{i=1}^J p_i - \sum_{i=1}^J p_i^2
= 1 - \sum^J_{i=1} p_i^2. </math>
 
The Gini impurity is also an information theoretic measure and corresponds to [[Tsallis Entropy]] with deformation coefficient <math>q=2</math>, which in physics is associated with the lack of information in out-of-equilibrium, non-extensive, dissipative and quantum systems. For the limit <math>q\to 1</math> one recovers the usual Boltzmann-Gibbs or Shannon entropy. In this sense, the Gini impurity is nothing but a variation of the usual entropy measure for decision trees.
 
===Information gain===