Multivariate adaptive regression spline: Difference between revisions

Content deleted Content added
Monkbot (talk | contribs)
m Task 18 (cosmetic): eval 2 templates: del empty params (2×);
Link suggestions feature: 3 links added.
 
(18 intermediate revisions by 15 users not shown)
Line 1:
{{Short description|Non-parametric regression technique}}
In [[statistics]], '''multivariate adaptive regression splines''' ('''MARS''') is a form of [[regression analysis]] introduced by [[Jerome H. Friedman]] in 1991.<ref>{{Cite journal | last1 = Friedman | first1 = J. H. | doi = 10.1214/aos/1176347963 | title = Multivariate Adaptive Regression Splines | journal = The Annals of Statistics | volume = 19 | issue = 1 | pages = 1–67 | year = 1991 |mr=1091842 | zbl = 0765.62064 | jstor = 2241837| citeseerx = 10.1.1.382.970 }}</ref> It is a [[non-parametric regression]] technique and can be seen as an extension of [[linear model]]s that automatically models nonlinearities and interactions between variables.
 
Line 83 ⟶ 84:
Each [[basis function]] <math>B_i(x)</math> takes one of the following three forms:
 
1) a constant 1. There is just one such term, [[The Intercept|the intercept]].
In the ozone formula above, the intercept term is 5.2.
 
Line 93 ⟶ 94:
 
== Hinge functions ==
 
[[File:Friedmans mars hinge functions.png|frame|right|A mirrored pair of hinge functions with a knot at x=3.1]]
{{further|Hinge function}}
 
Hinge functions are aA key part of MARS models. Aare ''hinge functionfunctions'' takestaking the form
: <math>\max(0,x-c)</math>
or
Line 135 ⟶ 136:
3) all values of each variable (for the knot of the new hinge function).
 
To calculate the coefficient of each term, MARS applies a linear regression over the terms.
 
This process of adding terms continues until the change in residual error is too small to continue or until the maximum number of terms is reached. The maximum number of terms is specified by the user before model building starts.
 
The search at each step is usually done in a [[Brute-force search|brute-force]] fashion, but a key aspect of MARS is that because of the nature of hinge functions, the search can be done relatively quickly using a fast least-squares update technique. Actually, the search is not quite brute Brute-force. The search can be sped up withby using a [[Heuristics|heuristic]] that reduces the number of parent terms to considerconsidered at each step ("Fast MARS"<ref>[[Friedman, J. H.]] (1993) ''Fast MARS'', Stanford University Department of Statistics, Technical Report 110</ref>).
 
=== The backward pass ===
 
The forward pass usually builds an [[overfit|overfits]] model. (An overfit model has a good fit to the data used to build the model but will not generalize well to new data.) To build a model with better generalization ability, the backward pass prunes the model. It removes terms one by one, deleting the least effective term at each step until it finds the best submodel. Model subsets are compared using the Generalized cross validation (GCV) criterion described below.
 
The backward pass has an advantage over the forward pass: at any step it can choose any term to delete, whereas the forward pass at each step can only see the next pair of terms.
Line 150 ⟶ 151:
 
==== Generalized cross validation ====
{{further|Cross-validation (statistics)|Model selection|Akaike information criterion}}
 
The backward pass compares the performance of different models using Generalized Cross-Validation (GCV), a minor variant on the [[Akaike information criterion]] that approximates the [[leave-one-out cross-validation]] score in the special case where errors are Gaussian, or where the squared error [[loss function]] is used. GCV was introduced by Craven and [[Grace Wahba|Wahba]] and extended by Friedman for MARS; lower values of GCV indicate better models. The formula for the GCV is
The backward pass uses generalized cross validation (GCV) to compare the performance of model subsets in order to choose the best subset: lower values of GCV are better. The GCV is a form of [[Regularization (machine learning)|regularization]]: it trades off goodness-of-fit against model complexity.
 
(We want to estimate how well a model performs on ''new'' data, not on the training data. Such new data is usually not available at the time of model building, so instead we use GCV to estimate what performance would be on new data. The raw [[Residual sum of squares|residual sum-of-squares]] (RSS) on the training data is inadequate for comparing models, because the RSS always increases as MARS terms are dropped. In other words, if the RSS were used to compare models, the backward pass would always choose the largest model—but the largest model typically does not have the best generalization performance.)
 
The formula for the GCV is
 
: GCV = RSS / (''N'' · (1 − (effective number of parameters) / ''N'')<sup>2</sup>)
Line 162 ⟶ 159:
where RSS is the residual sum-of-squares measured on the training data and ''N'' is the number of observations (the number of rows in the '''x''' matrix).
 
The ''EffectiveNumberOfParameters''effective number of parameters is defined inas
the MARS context as
 
: (effective number of parameters) = (number of mars terms) + (penalty) · ((number of Mars terms) − 1 ) / 2
 
where '''penalty''' is abouttypically 2 or(giving 3results equivalent to (the MARS[[Akaike softwareinformation allowscriterion]]) but can be increased by the user toif presetthey penalty)so desire.
 
Note that
Line 173 ⟶ 169:
: (number of Mars terms − 1 ) / 2
 
is the number of hinge-function knots, so the formula penalizes the addition of knots. Thus the GCV formula adjusts (i.e. increases) the training RSS to takepenalize intomore accountcomplex the flexibility of the modelmodels. We penalize flexibility because models that are too flexible will model the specific realization of noise in the data instead of just the systematic structure of the data.
 
Generalized cross-validation is so named because it uses a formula to approximate the error that would be determined by leave-one-out validation. It is just an approximation but works well in practice. GCVs were introduced by Craven and [[Grace Wahba|Wahba]] and extended by Friedman for MARS.
 
=== Constraints ===
Line 197 ⟶ 191:
 
== Pros and cons ==
*MARS models are simple to understand and interpret.<ref name=":0">{{Cite book|title=Applied Predictive Modeling|lastlast1=Kuhn|firstfirst1=Max|last2=Johnson|first2=Kjell|date=2013|publisher=Springer New York|isbn=9781461468486|___location=New York, NY|language=en|doi=10.1007/978-1-4614-6849-3}}</ref> Compare the equation for ozone concentration above to, say, the innards of a trained [[Artificial neural network|neural network]] or a [[random forest]].
{{original research|date=October 2016}}
*MARS can handle both continuous and [[categorical data]].<ref>{{cite book | last=Friedman | first=Jerome H. | chapter=Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines | author-link=Friedman, J. H.|year=1993|title=New Directions in Statistical Data Analysis and Robustness |editor=Stephan Morgenthaler |editor2=Elvezio Ronchetti |editor3=Werner Stahel|publisher=Birkhauser}}</ref><ref name="Friedman 1991">{{cite journal | last=Friedman | first=Jerome H. | title=Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines | website=DTIC | date=1991-06-01 | url=https://apps.dtic.mil/sti/citations/ADA590939 | archive-url=https://web.archive.org/web/20220411085148/https://apps.dtic.mil/sti/citations/ADA590939 | url-status=live | archive-date=April 11, 2022 | access-date=2022-04-11}}</ref>
 
*MARS (like recursive partitioning) does automatic [[Feature selection|variable selection]] (meaning it includes important variables in the model and excludes unimportant ones). However, there can be some arbitrariness in the selection, especially when there are correlated predictors, and this can affect interpretability.<ref name=":0" />
No regression modeling technique is best for all situations.
*Building MARS models often requires little or no data preparation.<ref name=":0" />
The guidelines below are intended to give an idea of the pros and cons of MARS,
* [https://web.stat.tamu.edu/~bmallick/wileybook/book_code.html Code] from the book ''Bayesian Methods for Nonlinear Classification and Regression''<ref>{{cite book |last1=Denison |first1=D. G. T. |last2=Holmes |first2=C. C. |last3=Mallick |first3=B. K. |last4=Smith |first4=A. F. M. |title=Bayesian methods for nonlinear classification and regression |date=2002 |publisher=Wiley |___location=Chichester, England |isbn=978-0-471-49036-4}}</ref> for Bayesian MARS.
but there will be exceptions to the guidelines.
It is useful to compare MARS to [[recursive partitioning]] and this is done below.
(Recursive partitioning is also commonly called ''regression trees'',
''decision trees'', or [[Predictive analytics#Classification and regression trees|CART]];
see the [[Decision tree learning|recursive partitioning]] article for details).
 
*MARS models are more flexible than [[linear regression]] models.
*MARS models are simple to understand and interpret.<ref name=":0">{{Cite book|title=Applied Predictive Modeling|last=Kuhn|first=Max|last2=Johnson|first2=Kjell|date=2013|publisher=Springer New York|isbn=9781461468486|___location=New York, NY|language=en|doi=10.1007/978-1-4614-6849-3}}</ref> Compare the equation for ozone concentration above to, say, the innards of a trained [[Artificial neural network|neural network]] or a [[random forest]].
*MARS can handle both continuous and categorical data.<ref>[[Friedman, J. H.]] (1993) ''Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines'', New Directions in Statistical Data Analysis and Robustness (Morgenthaler, Ronchetti, Stahel, eds.), Birkhauser</ref> MARS tends to be better than recursive partitioning for numeric data because hinges are more appropriate for numeric variables than the piecewise constant segmentation used by recursive partitioning.
*Building MARS models often requires little or no data preparation<ref name=":0" />. The hinge functions automatically partition the input data, so the effect of outliers is contained. In this respect MARS is similar to [[recursive partitioning]] which also partitions the data into disjoint regions, although using a different method. (Nevertheless, as with most statistical modeling techniques, known outliers should be considered for removal before training a MARS model.{{Citation needed|date=March 2019}})
*MARS (like recursive partitioning) does automatic [[Feature selection|variable selection]] (meaning it includes important variables in the model and excludes unimportant ones). However, there can be some arbitrariness in the selection, especially when there are correlated predictors, and this can affect interpretability<ref name=":0" />
*MARS models tend to have a good bias-variance trade-off. The models are flexible enough to model non-linearity and variable interactions (thus MARS models have fairly low bias), yet the constrained form of MARS basis functions prevents too much flexibility (thus MARS models have fairly low variance).
*MARS is suitable for handling fairly large datasets. It is a routine matter to build a MARS model from an input matrix with, say, 100 predictors and 10<sup>5</sup> observations. Such a model can be built in about a minute on a 1&nbsp;GHz machine, assuming the maximum degree of interaction of MARS terms is limited to one (i.e. additive terms only). A degree two model with the same data on the same 1&nbsp;GHz machine takes longer—about 12 minutes. Be aware that these times are highly data dependent. Recursive partitioning is much faster than MARS.{{Citation needed|date=March 2019}}
*With MARS models, as with any non-parametric regression, parameter confidence intervals and other checks on the model cannot be calculated directly (unlike [[linear regression]] models). [[Cross-validation (statistics)|Cross-validation]] and related techniques must be used for validating the model instead.
*MARS models do not give as good fits as [[Boosting (meta-algorithm)|boosted]] trees, but can be built much more quickly and are more interpretable. (An 'interpretable' model is in a form that makes it clear what the effect of each predictor is.)
*The <code>earth</code>, <code>mda</code>, and <code>polspline</code> implementations do not allow missing values in predictors, but free implementations of regression trees (such as <code>rpart</code> and <code>party</code>) do allow missing values using a technique called surrogate splits.
*MARS models can make predictions quickly. The prediction function simply has to evaluate the MARS model formula. Compare that to making a prediction with say a [[Support Vector Machine]], where every variable has to be multiplied by the corresponding element of every support vector. That can be a slow process if there are many variables and many support vectors.
*The resulting fitted function is not smooth (not differentiable along hinges).
 
== Extensions and related concepts ==
* [[Generalized linear model]]s (GLMs) can be incorporated into MARS models by applying a link function after the MARS model is built. Thus, for example, MARS models can incorporate [[logistic regression]] to predict probabilities.
* [[Nonlinear regression|Non-linear regression]] is used when the underlying form of the function is known and regression is used only to estimate the parameters of that function. MARS, on the other hand, estimates the functions themselves, albeit with severe constraints on the nature of the functions. (These constraints are necessary because discovering a model from the data is an [[inverse problem]] that is not [[Well-posed problem|well-posed]] without constraints on the model.)
* [[Recursive partitioning]] (commonly called CART). MARS can be seen as a generalization of recursive partitioning that allows thefor modelcontinuous tomodels, betterwhich handlecan numericalprovide (i.e.a non-categorical)better fit for numerical data.
* [[Generalized additive model]]s. FromUnlike the user's perspectiveMARS, GAMs are similar to MARS but (a) fit smooth [[Local regression|loess]] or polynomial [[Spline (mathematics)|splines]] insteadrather ofthan MARS basishinge functions, and (b)they do not automatically model variable interactions. The fittingsmoother methodfit usedand internallylack byof GAMsregression isterms veryreduces differentvariance fromwhen thatcompared ofto MARS., but For models that do not require automatic discovery ofignoring variable interactions GAMscan oftenworsen competethe favorably with MARSbias.
* [[TSMARS]]. Time Series Mars is the term used when MARS models are applied in a [[time series]] context. Typically in this set up the predictors are the lagged time series values resulting in autoregressive spline models. These models and extensions to include moving average spline models are described in "Univariate Time Series Modelling and Forecasting using TSMARS: A study of threshold time series autoregressive, seasonal and moving average models using TSMARS".
* [[Bayesian MARS]] (BMARS) uses the same model form, but builds the model using a Bayesian approach. It may arrive at different optimal MARS models because the model building approach is different. The result of BMARS is typically an ensemble of posterior samples of MARS models, which allows for probabilistic prediction.<ref>{{cite journal |last1=Denison |first1=D. G. T. |last2=Mallick |first2=B. K. |last3=Smith |first3=A. F. M. |title=Bayesian MARS |journal=Statistics and Computing |date=1 December 1998 |volume=8 |issue=4 |pages=337–346 |doi=10.1023/A:1008824606259 |s2cid=12570055 |url=https://link.springer.com/content/pdf/10.1023/A:1008824606259.pdf |language=en |issn=1573-1375}}</ref>
 
== See also ==
Line 245 ⟶ 223:
* Berk R.A. (2008) ''Statistical learning from a regression perspective'', Springer, {{ISBN|978-0-387-77500-5}}
 
== External links ==
{{external cleanup|date=October 2016}}
Several free and commercial software packages are available for fitting MARS-type models.
 
; Free software:
* [[R (programming language)|R]] packages:
** <code>earth</code> function in the <code>[https://cran.r-project.org/web/packages/earth/index.html earth]</code> package
** <code>mars</code> function in the <code>[https://cran.r-project.org/web/packages/mda/index.html mda]</code> package
** <code>polymars</code> function in the <code>[https://cran.r-project.org/web/packages/polspline/index.html polspline]</code> package. Not Friedman's MARS.
* Matlab code:
** [http://www.cs.rtu.lv/jekabsons/regression.html ARESLab: Adaptive Regression Splines toolbox for Matlab]
* Python
** [http://orange.biolab.si/blog/2011/12/20/earth-multivariate-adaptive-regression-splines/ Earth – Multivariate adaptive regression splines]
** [https://github.com/jcrudy/py-earth/ py-earth]
 
; Commercial software:
* [http://www.salford-systems.com/mars.php MARS] from Salford Systems. Based on Friedman's implementation.
* [https://web.archive.org/web/20101203023609/http://www.statsoft.com/products/data-mining-solutions/ STATISTICA Data Miner] from StatSoft
* [http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_adaptivereg_overview.htm ADAPTIVEREG] from SAS.
 
[[Category:Nonparametric regression]]