Chi-square automatic interaction detection: Difference between revisions

Content deleted Content added
Added applications of Chaid and added sources and citations for editors to refine it
 
(91 intermediate revisions by 59 users not shown)
Line 1:
*[[{{Short description|Decision tree learning]] technique}}
'''CHAID''' is a type of [[Decision_tree_learning|decision tree]] technique, based upon adjusted significance testing ([[Bonferroni testing]]). The technique was developed in [[South Africa]] and was published in 1980 by Gordon V. Kass. It can be used for prediction (in a similar fashion to [[regression analysis]], this version of CHAID being originally known as XAID) as well as classification, and for detection of interaction between variables. CHAID stands for '''CH'''i-squared '''A'''utomatic '''I'''nteraction '''D'''etector, based upon a formal extension of the US AID (Automatic Interaction Detector) and THAID (THeta Automatic Interaction Detector) procedures of the 1960's and 70's.
'''Chi-square automatic interaction detection''' ('''CHAID''')<ref name=":1" /> is a [[Decision tree learning|decision tree]] technique based on adjusted significance testing ([[Bonferroni correction]], [[Holm-Bonferroni method|Holm-Bonferroni testing]]).<ref name="kass1980">{{Cite journal |last=Kass |first=G. V. |date=1980 |title=An Exploratory Technique for Investigating Large Quantities of Categorical Data |url=https://www.jstor.org/stable/2986296 |journal=Applied Statistics |volume=29 |issue=2 |pages=119–127 |doi=10.2307/2986296|jstor=2986296 |url-access=subscription }}</ref><ref name=":0">{{Cite journal |last1=Biggs |first1=David |last2=De Ville |first2=Barry |last3=Suen |first3=Ed |date=1991 |title=A method of choosing multiway partitions for classification and decision trees |url=https://www.tandfonline.com/doi/full/10.1080/02664769100000005 |journal=Journal of Applied Statistics |language=en |volume=18 |issue=1 |pages=49–62 |doi=10.1080/02664769100000005 |bibcode=1991JApSt..18...49B |issn=0266-4763|url-access=subscription }}</ref>
 
==History==
In practice, CHAID is often used in the context of [[direct marketing]] to select groups of consumers and predict how their responses to some variables affect other variables, although early applications were in the field of medical and psychiatric research.
CHAID is based on a formal extension of AID (Automatic Interaction Detection)<ref name="morgan1963">{{Cite journal |last1=Morgan |first1=James N. |last2=Sonquist |first2=John A. |date=1963 |title=Problems in the Analysis of Survey Data, and a Proposal |url=http://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500855 |journal=Journal of the American Statistical Association |language=en |volume=58 |issue=302 |pages=415–434 |doi=10.1080/01621459.1963.10500855 |issn=0162-1459|url-access=subscription }}</ref> and THAID (THeta Automatic Interaction Detection)<ref name="messenger1972">{{Cite journal |last1=Messenger |first1=Robert |last2=Mandell |first2=Lewis |date=1972 |title=A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis |url=http://www.tandfonline.com/doi/abs/10.1080/01621459.1972.10481290 |journal=Journal of the American Statistical Association |language=en |volume=67 |issue=340 |pages=768–772 |doi=10.1080/01621459.1972.10481290 |issn=0162-1459|url-access=subscription }}</ref><ref name="morgan1973">{{Cite book |last=Morgan |first=James N. |title=THAID, a sequential analysis program for the analysis of nominal scale dependent variables |date=1973 |others=Robert C. Messenger |isbn=0-87944-137-2 |___location=Ann Arbor, Mich. |oclc=666930}}</ref> procedures of the 1960s and 1970s, which in turn were extensions of earlier research, including that performed by Belson in the UK in the 1950s.<ref>{{Cite journal |last=Belson |first=William A. |date=1959 |title=Matching and Prediction on the Principle of Biological Classification |url=https://www.jstor.org/stable/2985543 |journal=Applied Statistics |volume=8 |issue=2 |pages=65–75 |doi=10.2307/2985543|jstor=2985543 |url-access=subscription }}</ref>
 
In 1975, the CHAID technique itself was developed in South Africa. It was published in 1980 by Gordon V. Kass, who had completed a PhD thesis on the topic.<ref name="kass1980"/>
Like other decision trees, its advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively as with small sample sizes the respondent groups can quickly become too small for reliable analysis.
 
A history of earlier supervised tree methods can be found in [[Gilbert Ritschard|Ritschard]], including a detailed description of the original CHAID algorithm and the exhaustive CHAID extension by Biggs, De Ville, and Suen.<ref name=":0" /><ref name=":1">{{Cite journal |last=Ritschard |first=Gilbert |title=CHAID and Earlier Supervised Tree Methods |url=https://www.researchgate.net/publication/315476407 |journal=Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences, McArdle, J.J. And G. Ritschard (Eds) |___location=New York |publisher=Routledge |publication-date=2013 |pages=48–74}}</ref>
CHAID detects interaction between variables in the [[data set]]. Using this technique it is possible to establish relationships between a ‘dependent variable’ – for example readership of a certain newspaper – and other explanatory variables such as price, size, supplements etc. CHAID does this by identifying discrete groups of respondents and, by taking their responses to explanatory variables, seeks to predict what the impact will be on the dependent variable.
 
CHAID is oftenwas used as anthe exploratorydata mining technique. andIt is ana alternativetechnique based on multiway splitting to multiplecreate linear regressiondiscrete groups and logisticunderstand regression,their especially whenimpact on the datadependent setvariable. isCHAID notwas well-suitedpreferred to regressionfor analysis. because of five major criteria:
 
1. A good proportion of input data was categorical;
== See also ==
 
*[[Chi-square distribution]]
2. Its efficiency in large datasets;
 
3. Its highly visual and ease of interpretation;
 
4. Ease of implementation/integration of business rules generated from CHAID in business; and
 
5. Input data quality can be handled efficiently<ref>{{Cite web |last=Behera, Desik |first= |date=Nov 2012 |title=Acquiring Insurance Customer: The CHAID Way |url=https://www.researchgate.net/publication/256038754_Acquiring_Insurance_Customer_The_CHAID_Way |access-date=7 Aug 2025 |website=Research Gate}}</ref><ref>{{Cite web |last=Kotane |first=Inta |date=September 2024 |title=APPLICATION OF CHAID DECISION TREES AND NEURAL NETWORKS METHODS IN FORECASTING THE YIELD OF CEREAL INDUSTRY COMPANIES |url=https://www.researchgate.net/publication/383956028_APPLICATION_OF_CHAID_DECISION_TREES_AND_NEURAL_NETWORKS_METHODS_IN_FORECASTING_THE_YIELD_OF_CEREAL_INDUSTRY_COMPANIES |url-status=live |archive-url= |archive-date= |access-date=7 August 2025 |website=Research Gate |doi=10.17770/het2024.28.8264}}</ref>
 
==Properties==
CHAID can be used for prediction (in a similar fashion to [[regression analysis]], this version of CHAID being originally known as XAID) as well as classification, and for detection of interaction between variables.<ref name="morgan1963"/><ref name="messenger1972"/><ref name="morgan1973"/>
 
In practice, CHAID is often used in the context of [[direct marketing]] to select groups of consumers andto predict how their responses to some variables affect other variables, although other early applications were in the fieldfields of medical and psychiatric research.{{fact|date=December 2024}}
 
Like other decision trees, itsCHAID's advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, assince with small sample sizes the respondent groups can quickly become too small for reliable analysis.{{fact|date=December 2024}}
 
One important advantage of CHAID over alternatives such as multiple regression is that it is non-parametric.{{fact|date=December 2024}}
 
== See also ==
*[[Bonferroni correction]]
*[[Chi-squaresquared distribution]]
*[[Decision tree learning]]
*[[Latent class model]]
*[[Structural equation modeling]]
*[[Market segment]]
*[[Decision tree learning]]
*[[Multiple comparisons]]
*[[Structural equation modeling]]
 
==References==
{{reflist|1}}
* G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of Applied Statistics, Vol. 29, No. 2 (1980), pp. 119-127.
* D.M. Hawkins & G.V. Kass. Automatic Interaction Detection. In D.M. Hawkins (ed) Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982, pp. 269-302.
* T.M. Hooton, R.W. Haley, D.K. Culver, J.W. White, W.B. Morgan & R.J. Carroll. The Joint Associations of Multiple Risk Factors with the Occurrence of Nosocomial Infections. American Journal of Medicine, Vol. 70, (1981), pp. 960-970.
* S. Brink & D.J. Van Schalkwyk. Serum ferritin and mean corpuscular volume as predictors of bone marrow iron stores. South African Medical Journal, Vol. 61, (1982), pp. 432-434.
* D.P. McKenzie, P.D. McGorry, C.S. Wallace, L.H. Low, D.L. Copolov & B.S. Singh. Constructing a Minimal Diagnostic Decision Tree. Methods of Information in Medicine, Vol. 32 (1993), pp. 161-166.
 
 
==External links==
*[http://www.statisticalinnovations.com/products/chaid_v4.html SI-CHAID]
*[http://www.smartdrill.com/About/process4.html SmartDrill - Analytic Techniques: CHAID]
*[http://www.angoss.com/analytics_software/KnowledgeSEEKER.php KnowledgeSEEKER CHAID]
*[http://www.goldenhelix.com/Predictive_Analytics/optimus_rp.html OPTIMUS-RP CHAID]
*[http://www.statsoft.com/textbook/stchaid.html Statsoft - CHAID Analysis]
*[http://www.jmp.com/software/whitepapers/partition_platform/index.shtml JMP Partition Platform]
*[http://www.xlstat.com/en/features/trees.htm CHAID in XLSTAT]
*[http://www.spss.com/answertree/decisiontrees.htm SPSS - How decision tree results are different in AnswerTree]
*[http://www.zementis.com/products.htm ADAPA - Batch and real-time scoring of data mining models, including decision trees - CHAID]
*[http://r-forge.r-project.org/projects/chaid/ R-Forge CHAID - CHAID package download for the free R statistical software]
*[http://eric.univ-lyon2.fr/~ricco/sipina.html CHAID and other tree-building algorithms available in free SIPINA package]
*[ftp://ftp.stat.umn.edu/pub/FIRM/ FIRM 3 - free CHAID software]
 
==Bibliography==
* Press, Laurence I.; Rogers, Miles S.; & Shure, Gerald H.; ''An interactive technique for the analysis of multivariate data'', Behavioral Science, Vol. 14 (1969), pp.&nbsp;364–370
* D.Hawkins, Douglas M.; Hawkinsand &Kass, Gordon G.V.; Kass. ''Automatic Interaction Detection.'', Inin Hawkins, Douglas D.M. Hawkins (ed), ''Topics in Applied Multivariate Analysis.'', Cambridge University Press, Cambridge, 1982, pp. 269-302.&nbsp;269–302
* T.M. Hooton, R.WThomas M.; Haley, D.KRobert W.; Culver, J.WDavid H.; White, John W.B.; Morgan, W. Meade; & R.J. Carroll, Raymond J.; ''The Joint Associations of Multiple Risk Factors with the Occurrence of Nosocomial Infections.'', American Journal of Medicine, Vol. 70, (1981), pp. 960-970.&nbsp;960–970
* S. Brink, Susanne; & D.J. Van Schalkwyk, Dirk J.; ''Serum ferritin and mean corpuscular volume as predictors of bone marrow iron stores.'', South African Medical Journal, Vol. 61, (1982), pp. 432-434.&nbsp;432–434
* D.P. McKenzie, Dean P.D.; McGorry, C.SPatrick D.; Wallace, L.HChris S.; Low, D.LLee H.; Copolov, &David B.SL.; & Singh, Bruce S.; ''Constructing a Minimal Diagnostic Decision Tree.'', Methods of Information in Medicine, Vol. 32 (1993), pp. 161-166.&nbsp;161–166
* Magidson, Jay; ''The CHAID approach to segmentation modeling: chi-squared automatic interaction detection'', in Bagozzi, Richard P. (ed); ''Advanced Methods of Marketing Research'', Blackwell, Oxford, GB, 1994, pp.&nbsp;118–159
* Hawkins, Douglas M.; Young, S. S.; & Rosinko, A.; ''Analysis of a large structure-activity dataset using recursive partitioning'', Quantitative Structure-Activity Relationships, Vol. 16, (1997), pp.&nbsp;296–302
 
==External linkslkinks==
* Luchman, J.N.; ''CHAID: Stata module to conduct chi-square automated interaction detection'', Available for free [https://ideas.repec.org/c/boc/bocode/s457752.html download], or type within Stata: ssc install chaid.
* Luchman, J.N.; ''CHAIDFOREST: Stata module to conduct random forest ensemble classification based on chi-square automated interaction detection (CHAID) as base learner'', Available for free [https://ideas.repec.org/c/boc/bocode/s457932.html download], or type within Stata: ssc install chaidforest.
* [https://www.ibm.com/downloads/cas/Z6XD69WQ IBM SPSS Decision Trees] grows exhaustive CHAID trees as well as a few other types of trees such as CART.
* An R package ''[https://r-forge.r-project.org/R/?group_id=343 CHAID]'' is available on R-Forge.
 
[[Category:Market research]]
[[Category:Market segmentation]]
[[Category:Statistical algorithms]]
[[Category:RegressionStatistical analysisclassification]]
[[Category:Decision trees]]
[[Category:Classification algorithms]]
 
[[de:CHAID]]
[[fr:CHAID]]