Local regression: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 02:35, 16 February 2025 edit Zaqrfv (talk \| contribs) 289 edits →Choice of Fitting Criterion: quantile regression ← Previous edit		Latest revision as of 07:26, 12 July 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,558 edits Altered doi-broken-date. \| Use this bot. Report bugs. \| #UCB_CommandLine
(34 intermediate revisions by 12 users not shown)
Line 1: {{short description\|Moving average and polynomial regression method for smoothing data}} ~~{{More footnotes\|date=June 2011}}~~ [[Image:Loess curve.svg\|thumb\|LOESS curve fitted to a population sampled from a [[sine wave]] with uniform noise added. The LOESS curve approximates the original sine wave.]] {{Regression bar}} Line 10 ⟶ 9: LOESS and LOWESS thus build on [[classical statistics\|"classical" methods]], such as linear and nonlinear [[least squares regression]]. They address situations in which the classical procedures do not perform well or cannot be effectively applied without undue labor. LOESS combines much of the simplicity of linear least squares regression with the flexibility of [[Non-linear regression\|nonlinear regression]]. It does this by fitting simple models to localized subsets of the data to build up a function that describes the deterministic part of the variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not required to specify a global function of any form to fit a model to the data, only to fit segments of the data. The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would have been practically impossible to use in the era when least squares regression was being developed. Most other modern methods for process ~~modeling~~modelling are similar to LOESS in this respect. These methods have been consciously designed to use our current computational ability to the fullest possible advantage to achieve goals not easily achieved by traditional approaches. A smooth curve through a set of data points obtained with this statistical technique is called a '''loess curve''', particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of values of the ''y''-axis [[scattergram]] criterion variable. When each smoothed value is given by a weighted linear least squares regression over the span, this is known as a '''lowess curve.'''; ~~however~~However, some authorities treat '''lowess''' and loess as synonyms.<ref>Kristen Pavlik, US Environmental Protection Agency, ''[https://19january2021snapshot.epa.gov/sites/static/files/2016-07/documents/loess-lowess.pdf Loess (or Lowess)]'', '''Nutrient Steps''', July 2016.</ref><ref name="NIST"/> ==History== Local regression and closely related procedures have a long and rich history, having been discovered and rediscovered in different fields on multiple occasions. An early work by [[Robert Henderson (mathematician)\|Robert Henderson]]<ref>Henderson, R. Note on Graduation by Adjusted Average. Actuarial Society of America Transactions 17, 43--48. [https://archive.org/details/transactions17actuuoft archive.org]</ref> studying the problem of graduation (a term for smoothing used in Actuarial literature) introduced local regression using cubic polynomials, and showed how earlier graduation methods could be interpreted as local polynomial fitting. [[William S. Cleveland]] and [[Catherine Loader]] (1995)<ref>{{citeQ\|Q132138257}}</ref>; and [[Lori Murray]] and [[David Bellhouse (statistician)\|David Bellhouse]] (2019)<ref>{{citeQ\|Q127772934}}</ref> discuss more of the historical work on graduation. Specifically, let <math>Y_j</math> denote an ungraduated sequence of observations. Following Henderson, suppose that only the terms from <math>Y_{-h}</math> to <math>Y_h</math> are to be taken into account when computing the graduated value of <math>Y_0</math>, and <math>W_j</math> is the weight to be assigned to <math>Y_j</math>. Henderson then uses a local polynomial approximation <math>a + b j + c j^2 + d j^3</math>, and sets up the following four equations for the coefficients: The [[Savitzky-Golay filter]], introduced by [[Abraham Savitzky]] and [[Marcel J. E. Golay]] (1964)<ref>{{citeQ\|Q56769732}}</ref> significantly expanded the method. Like the earlier graduation work, the focus was on data with an equally-spaced predictor variable, where (excluding boundary effects) local regression can be represented as a [[convolution]]. Savitzky and Golay published extensive sets of convolution coefficients for different orders of polynomial and smoothing window widths.▼ :<math> \begin{align} \sum_{j=-h}^h ( a + b j + c j^2 + d j^3) W_j &= \sum_{j=-h}^h W_j Y_j \\ \sum_{j=-h}^h ( aj + b j^2 + c j^3 + d j^4) W_j &= \sum_{j=-h}^h j W_j Y_j \\ \sum_{j=-h}^h ( aj^2 + b j^3 + c j^4 + d j^5) W_j &= \sum_{j=-h}^h j^2 W_j Y_j \\ \sum_{j=-h}^h ( aj^3 + b j^4 + c j^5 + d j^6) W_j &= \sum_{j=-h}^h j^3 W_j Y_j \end{align} </math> Solving these equations for the polynomial coefficients yields the graduated value, <math>\hat Y_0 = a</math>. Henderson went further. In preceding years, many 'summation formula' methods of graduation had been developed, which derived graduation rules based on summation formulae (convolution of the series of obeservations with a chosen set of weights). Two such rules are the 15-point and 21-point rules of [[John Spencer (Actuary)\|Spencer]] (1904).<ref>{{citeQ\|Q127775139}}</ref> These graduation rules were carefully designed to have a quadratic-reproducing property: If the ungraduated values exactly follow a quadratic formula, then the graduated values equal the ungraduated values. This is an important property: a simple moving average, by contrast, cannot adequately model peaks and troughs in the data. Henderson's insight was to show that ''any'' such graduation rule can be represented as a local cubic (or quadratic) fit for an appropriate choice of weights. Further discussions of the historical work on graduation and local polynomial fitting can be found in [[Frederick Macaulay\|Macaulay]] (1931),<ref name="mac1931">{{citeQ\|Q134465853}}</ref> [[William S. Cleveland\|Cleveland]] and [[Catherine Loader\|Loader]] (1995);<ref name="slrpm">{{citeQ\|Q132138257}}</ref> and [[Lori Murray\|Murray]] and [[David Bellhouse (statistician)\|Bellhouse]] (2019).<ref>{{cite Q\|Q127772934}}</ref> ▲The [[Savitzky-Golay filter]], introduced by [[Abraham Savitzky]] and [[Marcel J. E. Golay]] (1964)<ref>{{~~citeQ~~cite Q\|Q56769732}}</ref> significantly expanded the method. Like the earlier graduation work, ~~the~~their focus was on data with an equally-spaced predictor variable, where (excluding boundary effects) local regression can be represented as a [[convolution]]. Savitzky and Golay published extensive sets of convolution coefficients for different orders of polynomial and smoothing window widths. Local regression methods started to appear extensively in statistics literature in the ~~1970's~~1970s; for example, [[Charles Joel Stone\|Charles J. Stone]] (1977),<ref>{{~~citeQ~~cite Q\|Q56533608}}</ref>, [[Vladimir Katkovnik]] (1979)<ref>{{~~cite~~citation \|first=Vladimir\|last=Katkovnik\|title=Linear and nonlinear methods of nonparametric regression analysis\|journal=Soviet Automatic Control\|date=1979\|volume=12\|issue=5\|pages=~~25-34~~25–34}}</ref> and [[William S. Cleveland]] (1979).<ref name="cleve79">{{~~citeQ~~cite Q\|Q30052922}}</ref>. Katkovnik (1985)<ref name="katbook">{{citeQ\|Q132129931}}</ref> is the earliest book devoted primarily to local regression methods. ~~Extensive theoretical~~Theoretical work continued to appear throughout the ~~1990's~~1990s. Important contributions include [[Jianqing Fan]] and [[Irène Gijbels]] (1992)<ref>{{~~citeQ~~cite Q\|Q132202273}}</ref> studying efficiency properties, and [[David Ruppert]] and [[Matthew P. Wand]] (1994)<ref>{{~~citeQ~~cite Q\|Q132202598}}</ref> developing an asymptotic distribution theory for multivariate local regression. An important extension of local regression is Local Likelihood Estimation, formulated by [[Robert Tibshirani]] and [[Trevor Hastie]] (1987).<ref name="tib-hast-lle">{{~~citeQ~~cite Q\|Q132187702}}</ref> This replaces the local least-squares criterion with a likelihood-based criterion, thereby extending the local ~~regresion~~regression method to the [[Generalized linear model]] setting; for example binary data;, count data; or censored data. Practical implementations of local regression began appearing in statistical software in the ~~1980's~~1980s. Cleveland (1981)<ref>{{~~citeQ~~cite Q\|Q29541549}}</ref> introduces the LOWESS routines, intended for smoothing scatterplots. This implements local linear fitting with a single predictor variable, and also introduces robustness downweighting to make the procedure resistant to outliers. An entirely new implementation, LOESS, is described in Cleveland and [[Susan J. Devlin]] (1988).<ref name="clevedev">{{~~citeQ~~cite Q\|Q29393395}}</ref>. LOESS is a multivariate smoother, able to handle spatial data with two (or more) predictor variables, and uses (by default) local quadratic fitting. Both LOWESS and LOESS are implemented in the [[S (programming language)\|S]] and [[R (programming language)\|R]] programming languages. See also Cleveland's Local Fitting Software.<ref>{{cite web \|last=Cleveland\|first=William\|title=Local Fitting Software\|url=http://www.stat.purdue.edu/~wsc/localfitsoft.html\|archive-url=https://web.archive.org/web/20050912090738/http://www.stat.purdue.edu/~wsc/localfitsoft.html \|archive-date=12 September 2005 }}</ref> While Local Regression, LOWESS and LOESS are sometimes used ~~interchangably~~interchangeably, this usage should be considered incorrect. Local Regression is a general term for the fitting procedure; LOWESS and LOESS are two distinct implementations. ==Model definition== Line 36 ⟶ 50: For ease of presentation, the development below assumes a single predictor variable; the extension to multiple predictors (when the <math>x_i</math> are vectors) is conceptually straightforward. A functional relationship between the predictor and response variables is assumed: <math display="block">Y_i = \mu(x_i) + \epsilon_i</math> where <math>\mu(x)</math> is the unknown ‘smooth’ regression function to be estimated, and represents the conditional expectation of the response, given a value of the predictor variables. In theoretical work, the ‘smoothness’ of this function can be formally characterized by placing bounds on higher order derivatives. The <math>\epsilon_i</math> represents random error; for estimation purposes these are assumed to have [[mean]] zero. Stronger assumptions (ege.g., [[independence (probability theory)\|independence]] and equal [[variance]]) may be made when assessing properties of the estimates. Local regression then estimates the function <math>\mu(x)</math>, for one value of <math>x</math> at a time. Since the function is assumed to be smooth, the most informative data points are those whose <math>x_i</math> values are close to <math>x</math>. This is formalized with a bandwidth <math>h</math> and a [[kernel (statistics)\|kernel]] or weight function <math>W(\cdot)</math>, with observations assigned weights Line 49 ⟶ 63: \sum_{i=1}^n w_i(x) \left ( Y_i - \beta_0 - \beta_1(x_i-x) - \ldots - \beta_p(x_i-x)^p \right )^2. </math> The local ~~regresssion~~regression estimate of <math>\mu(x)</math> is then simply the intercept estimate: <math display="block">\hat\mu(x) = \hat\beta_0</math> while the remaining coefficients can be interpreted Line 56 ⟶ 70: It is to be emphasized that the above procedure produces the estimate <math>\hat\mu(x)</math> for one value of <math>x</math>. When considering a new value of <math>x</math>, a new set of weights <math>w_i(x)</math> must be computed, and the regression coefficient estimated afresh. ===Matrix ~~Representation~~representation of the ~~Local~~local ~~Regression~~regression ~~Estimate~~estimate=== As with all least squares estimates, the estimated regression coefficients can be expressed in closed form (see [[Weighted least squares]] for details): Line 65 ⟶ 79: This matrix representation is crucial for studying the theoretical properties of local regression estimates. With appropriate definitions of the design and weight matrices, it immediately generalizes to the multiple-predictor setting. ==Selection ~~Issues~~issues: ~~Bandwidth~~bandwidth, local model, fitting criteria== Implementation of local regression requires specification and selection of several components: Line 71 ⟶ 85: # The degree of local polynomial, or more generally, the form of the local model. # The choice of weight function <math>W(\cdot)</math>. # The choice of fitting criterion (least ~~sqaures~~squares or something else). Each of these components has been the subject of extensive study; a summary is provided below. Line 101 ⟶ 115: One question not addressed above is, how should the bandwidth depend upon the fitting point <math>x</math>? Often a constant bandwidth is used, while LOWESS and LOESS prefer a nearest-neighbor bandwidth, meaning ''h'' is smaller in regions with many data points. Formally, the smoothing parameter, <math>\alpha</math>, is the fraction of the total number ''n'' of data points that are used in each local fit. The subset of data used in each weighted least squares fit thus comprises the <math>n\alpha</math> points (rounded to the next largest integer) whose explanatory variables' values are closest to the point at which the response is being estimated.<ref name="NIST">NIST, [http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm "LOESS (aka LOWESS)"], section 4.1.4.4, ''NIST/SEMATECH e-Handbook of Statistical Methods,'' (accessed 14 April 2017)</ref> More sophisticated methods attempt to choose the bandwidth ''adaptively''; that is, choose a bandwidth at each fitting point <math>x</math> by applying criteria such as cross-validation locally within the smoothing window. An early example of this is [[Jerome H. Friedman]]'s<ref>{{~~cite~~citation\|first=Jerome H.\|last=Friedman\|title=A Variable Span Smoother\|date=October 1984\|publisher=Technical report, Laboratory for Computational Statistics LCS 5; SLAC PUB-3466\|doi=10.2171/1447470\|doi-broken-date=1 July 2025 \|url=http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-3477.pdf}}</ref> "supersmoother", which uses cross-validation to choose among local linear fits at different bandwidths. ===Degree of local polynomials=== Most sources, in both theoretical and computational work, use low-order polynomials as the local model, with polynomial degree ranging from 0 to 3. The degree 0 (local constant) model is equivalent to a [[kernel smoother]]; usually credited to [[Èlizbar Nadaraya]] (1964)<ref>{{~~citeQ~~cite Q\|Q29303512}}</ref> and [[G. S. Watson]] (1964).<ref>{{~~cite~~citation\|last=Watson\|first=G. S.\|title=Smooth regression analysis\|journal=Sankhya Series A\|volume=26\|pages=~~359-372~~359–372}}</ref>. This is the simplest model to use, but can suffer from bias when fitting near boundaries of the dataset. Local linear (degree 1) fitting can substantially reduce the boundary bias. Line 112 ⟶ 126: Local quadratic (degree 2) and local cubic (degree 3) can result in improved fits, particularly when the underlying mean function <math>\mu(x)</math> has substantial curvature, or equivalently a large second derivative. In theory, higher orders of polynomial can lead to faster convergence of the estimate <math>\hat\mu(x)</math> to the true mean <math>\mu(x)</math>, ''provided that <math>\mu(x)</math> has a sufficient number of derivatives''. See C. J. Stone (1980).<ref>{{~~citeQ~~cite Q\|Q132272803}}</ref> Generally, it takes a large sample size for this faster convergence to be realized. There are also computational and stability issues that arise, particularly for multivariate smoothing. It is generally not recommended to use local polynomials with degree greater than 3. As with bandwidth selection, methods such as cross-validation can be used to compare the fits obtained with different degrees of polynomial. Line 119 ⟶ 133: As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local model [[Parameter#Statistics\|parameter]] [[Statistical estimation\|estimates]]. Cleveland (1979)<ref name="cleve79" /> sets out four requirements for the weight function: The traditional weight function used for LOESS is the [[Kernel (statistics)#Kernel functions in common use\|tri-cube weight function]],▼ # Non-negative: <math>wW(dx) => (10</math> -for <math>\|dx\|~~^3)^3~~ < 1</math>. # Symmetry: <math>W(-x) = W(x)</math>. ~~where ''d'' is the distance of a given data point from the point on the curve being fitted, scaled to lie in the range from 0 to 1.<ref name="NIST" />~~ # Monotone: <math>W(x)</math> is a nonincreasing function for <math>x \ge 0</math>. # Bounded support: <math>W(x)=0</math> for <math>\|x\| \ge 1</math>. Asymptotic efficiency of weight functions has been considered by [[V. A. Epanechnikov]] (1969)<ref>{{citeQ\|Q57308723}}</ref> in the context of kernel density estimation; J. Fan (1993)<ref>{{citeQ\|Q132691957}}</ref> has derived similar results for local regression. They conclude that the quadratic kernel, <math>W(x) = 1-x^2</math> for <math>\|x\|\le1</math> has greatest efficiency under a mean-squared-error loss function. See [[Kernel (statistics)#Kernel functions in common use\|"kernel functions in common use"]] for more discussion of different kernels and their efficiencies. However, any other weight function that satisfies the properties listed in Cleveland (1979) could also be used. The weight for a specific point in any localized subset of data is obtained by evaluating the weight function at the distance between that point and the point of estimation, after scaling the distance so that the maximum absolute distance over all of the points in the subset of data is exactly one. Considerations other than MSE are also relevant to the choice of weight function. Smoothness properties of <math>W(x)</math> directly affect smoothness of the estimate <math>\hat\mu(x)</math>. In particular, the quadaratic kernel is not differentiable at <math>x=\pm 1</math>, and <math>\hat\mu(x)</math> is not differentiable as a result. Consider the following generalisation of the linear regression model with a metric <math>w(x,z)</math> on the target space <math>\mathbb R^m</math> that depends on two parameters, <math>x,z\in\mathbb R^p</math>. Assume that the linear hypothesis is based on <math>p</math> input parameters and that, as customary in these cases, we embed the input space <math>\mathbb R^p</math> into <math>\mathbb R^{p+1}</math> as <math>x\mapsto \hat x := (1,x)</math>, and consider the following ''[[loss function]]'' ▲The ~~traditional weight function used for LOESS is the~~ [[Kernel (statistics)#Kernel functions in common use\|tri-cube weight function]], <math display="block">W(x) = (1 - \|x\|^3)^3; \|x\|<1</math> has been used in LOWESS and other local regression software; this combines higher-order differentiability with a high MSE efficiency. One criticism of weight functions with bounded support is that they can lead to numerical problems (i.e. an unstable or singular design matrix) when fitting in regions with sparse data. For this reason, some authors{{who\|date=April 2025}} choose to use the Gaussian kernel, or others with unbounded support. ~~:<math>\operatorname{RSS}_x(A) = \sum_{i=1}^N(y_i-A\hat x_i)^Tw_i(x)(y_i-A\hat x_i).</math>~~ ===Choice of ~~Fitting~~fitting ~~Criterion~~criterion===▼ Here, <math>A</math> is an <math>m\times(p+1)</math> real matrix of coefficients, <math>w_i(x):=w(x_i,x)</math> and the subscript ''i'' enumerates input and output vectors from a training set. Since <math>w</math> is a metric, it is a symmetric, positive-definite matrix and, as such, there is another symmetric matrix <math>h</math> such that <math>w=h^2</math>. The above loss function can be rearranged into a trace by observing that As described above, local regression uses a locally weighted least squares criterion to estimate the regression parameters. This inherits many of the advantages (ease of implementation and interpretation; good properties when errors are normally distributed) and disadvantages (sensitivity to extreme values and outliers; inefficiency when errors have unequal variance or are not normally distributed) usually associated with least squares regression.▼ ~~:<math>y^Twy = (hy)^T(hy) = \operatorname{Tr}(hyy^Th) = \operatorname{Tr}(wyy^T)</math>.~~ These disadvantages can be addressed by replacing the local least-squares estimation by something else. Two such ideas are presented here: local likelihood estimation, which applies local estimation to the [[generalized linear model]], and robust local regression, which localizes methods from [[robust regression]]. By arranging the vectors <math>y_i</math> and <math>\hat x_i</math> into the columns of a <math>m\times N</math> matrix <math>Y</math> and an <math>(p+1)\times N</math> matrix <math>\hat X</math> respectively, the above loss function can then be written as ====Local likelihood estimation==== ~~:<math>\operatorname{Tr}(W(x)(Y-A\hat X)^T(Y-A\hat X))</math>~~ In local likelihood estimation, developed in Tibshirani and Hastie (1987),<ref name="tib-hast-lle" /> the observations <math>Y_i</math> are assumed to come from a parametric family of distributions, with a known probability density function (or mass function, for discrete data), where <math>W</math> is the square diagonal <math>N\times N</math> matrix whose entries are the <math>w_i(x)</math>s. Differentiating with respect to <math>A</math> and setting the result equal to 0 one finds the extremal matrix equation <math display="block"> Y_i \sim f(y,\theta(x_i)), </math> where the parameter function <math>\theta(x)</math> is the unknown quantity to be estimated. To estimate <math>\theta(x)</math> at a particular point <math>x</math>, the local likelihood criterion is <math display="block"> \sum_{i=1}^n w_i(x) \log \left ( f(Y_i, \beta_0 + \beta_1(x_i-x) + \ldots + \beta_p (x_i-x)^p \right ). </math> Estimates of the regression coefficients (in, particular, <math>\hat\beta_0</math>) are obtained by maximizing the local likelihood criterion, and the local likelihood estimate is <math display="block"> \hat\theta(x) = \hat\beta_0. </math> When <math>f(y,\theta(x))</math> is the normal distribution and <math>\theta(x)</math> is the mean function, the local likelihood method reduces to the standard local least-squares regression. For other likelihood families, there is (usually) no closed-form solution for the local likelihood estimate, and iterative procedures such as [[iteratively reweighted least squares]] must be used to compute the estimate. ~~:<math>A\hat XW(x)\hat X^T = YW(x)\hat X^T</math>.~~ ''Example'' (local logistic regression). All response observations are 0 or 1, and the mean function is the "success" probability, <math>\mu(x_i) = \Pr (Y_i=1 \| x_i)</math>. Since <math>\mu(x_i)</math> must be between 0 and 1, a local polynomial model should not be used for <math>\mu(x)</math> directly. Insead, the logistic transformation ~~Assuming further that the square matrix <math>\hat XW(x)\hat X^T</math> is non-singular, the loss function <math>\operatorname{RSS}_x(A)</math> attains its minimum at~~ <math display="block"> ~~:<math>w~~ \theta(x,z) = \~~exp~~log \left (- \frac{\\| mu(x~~-z \\|^2~~)}{21-\~~alpha^2~~mu(x)} \right )~~</math>.~~▼ </math> can be used; equivalently, <math display="block"> \begin{align} 1-\mu(x) &= \frac{1}{1+e^{\theta(x)}} ;\\ \mu(x) &= \frac{e^{\theta(x)}}{1+e^{\theta(x)}} \end{align} </math> and the mass function is <math display="block"> f(Y_i,\theta(x_i)) = \frac{ e^{Y_i \theta(x_i)}}{1+e^{\theta(x_i)}}. </math> An asymptotic theory for local likelihood estimation is developed in J. Fan, [[Nancy E. Heckman]] and M.P.Wand (1995);<ref>{{cite Q\|Q132508409}}</ref> the book Loader (1999)<ref name="loabook">{{citeQ\|Q59410587}}</ref> discusses many more applications of local likelihood. ~~:<math>A(x) = YW(x)\hat X^T(\hat XW(x)\hat X^T)^{-1}</math>.~~ ====Robust local regression==== ~~A typical choice for <math>w(x,z)</math> is the [[Gaussian function\|Gaussian weight]]~~ ▲:<math>w(x,z) = \exp\left(-\frac{\\| x-z \\|^2}{2\alpha^2}\right)</math>. ▲===Choice of Fitting Criterion=== ▲As described above, local regression uses a locally weighted least squares criterion to estimate the regression parameters. This inherits many of the advantages (ease of implementation and interpretation; good properties when errors are normally distributed) and disadvantages (sensitivity to extreme values and outliers; inefficiency when errors have unequal variance or are not normally distributed) usually associated with least squares regression. To address the sensitivity to outliers, techniques from [[robust regression]] can be employed. In local [[M-estimator\|M-estimation]], the local least-squares criterion is replaced by a criterion of the form Line 159 ⟶ 199: \right ) </math> where <math>\rho(\cdot)</math> is a robustness function and <math>s</math> is a scale parameter. Discussion of the merits of different choices of robustness function is best left to the [[robust regression]] literature. The scale ~~paramter~~parameter <math>s</math> must also be estimated. References for local M-estimation include Katkovnik (1985)<ref name="katbook">{{citeQ\|Q132129931}}</ref> and [[Alexandre Tsybakov]] (1986).<ref>{{~~cite~~citation \|first=Alexandre B.\|last=Tsybakov\|title=Robust reconstruction of functions by the local-approximation method.\|journal=Problems of Information Transmission\|volume=22\|pages=~~133-146~~133–146}}</ref> The robustness iterations in LOWESS and LOESS correspond to the robustness function defined by Line 171 ⟶ 211: \sum_{i=1}^n w_i(x) \left \| Y_i - \beta_0 - \ldots - \beta_p(x_i-x)^p \right \| </math> results; this does not require a scale parameter. When <math>p=0</math>, this criterion is minimized by a locally- weighted median; local <math>L_1</math> regression can be interpreted as estimating the ''median'', rather than ''mean'', response. If the loss function is skewed, this becomes local quantile regression. See [[Keming Yu]] and [[M. C. Jones (mathematician)\|M.C. Jones]] (1998).<ref>{{~~cite~~citation \|first1=Keming\|last1=Yu\|first2=M.C.\|last2=Jones\|title=Local Linear Quantile Regression\|journal=Journal of the American Statistical Association\|date=1998 \|volume=93\|issue=441 \|pages=~~228-237~~228–237\|doi=10.1080/01621459.1998.10474104 }}</ref> ==Advantages== Line 184 ⟶ 224: Finally, as discussed above, LOESS is a computationally intensive method (with the exception of evenly spaced data, where the regression can then be phrased as a non-causal [[finite impulse response]] filter). LOESS is also prone to the effects of outliers in the data set, like other least squares methods. There is an iterative, [[robust statistics\|robust]] version of LOESS [Cleveland (1979)] that can be used to reduce LOESS' sensitivity to [[outliers]], but too many extreme outliers can still overcome even the robust method. ==Further reading== Books substantially covering local regression and extensions: * Macaulay (1931) "The Smoothing of Time Series",<ref name="mac1931">{{citeQ\|Q134465853}}</ref> discusses graduation methods with several chapters related to local polynomial fitting. * Katkovnik (1985) "Nonparametric Identification and Smoothing of Data"<ref name="katbook">{{citeQ\|Q132129931}}</ref> in Russian. * Fan and Gijbels (1996) "Local Polynomial Modelling and Its Applications".<ref>{{citeQ\|Q134377589}}</ref> * Loader (1999) "Local Regression and Likelihood".<ref name="loabook">{{citeQ\|Q59410587}}</ref> * Fotheringham, Brunsdon and Charlton (2002), "Geographically Weighted Regression"<ref name="gwrbook">{{citeQ\|Q133002722}}</ref> (a development of local regression for spatial data). Book chapters, Reviews: * "Smoothing by Local Regression: Principles and Methods"<ref name="slrpm">{{citeQ\|Q132138257}}</ref> * "Local Regression and Likelihood", Chapter 13 of ''Observed Brain Dynamics'', Mitra and Bokil (2007)<ref>{{citeQ\|Q57575432}}</ref> * [[Rafael Irizarry (scientist)\|Rafael Irizarry]], "Local Regression". Chapter 3 of "Applied Nonparametric and Modern Statistics".<ref>{{cite web\|last=Irizarry\|first=Rafael\|title=Applied Nonparametric and Modern Statistics\|url=https://rafalab.dfci.harvard.edu/pages/754/\|access-date=2025-05-16}}</ref> ==See also== Line 207 ⟶ 261: ==External links== {{~~extlinks~~external links\|date=November 2021}} [http://voteforamerica.net/editorials/Comments.aspx?ArticleId=28&ArticleName=Electoral+Projections+Using+LOESS Local Regression and Election Modeling] [http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm NIST Engineering Statistics Handbook Section on LOESS] [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html Scatter Plot Smoothing] [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/loess.html R: Local Polynomial Regression Fitting] The Loess function in [[R (programming language)\|R]] [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lowess.html R: Scatter Plot Smoothing] The Lowess function in [[R (programming language)\|R]] [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/supsmu.html The supsmu function] (Friedman's SuperSmoother) in R] [http://www.r-statistics.com/2010/04/quantile-loess-combining-a-moving-quantile-window-with-loess-r-function/ Quantile LOESS] – A method to perform Local regression on a '''Quantile''' moving window (with R code) [http://fivethirtyeight.blogs.nytimes.com/2013/03/26/how-opinion-on-same-sex-marriage-is-changing-and-what-it-means/?hp Nate Silver, How Opinion on Same-Sex Marriage Is Changing, and What It Means] – sample of LOESS versus linear regression