Local regression: Difference between revisions

Content deleted Content added
Zaqrfv (talk | contribs)
Weight function: rewrite; additional considerations
Line 119:
As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local model [[Parameter#Statistics|parameter]] [[Statistical estimation|estimates]].
 
Cleveland (1979)<ref name="cleve79" /> sets out four requirements for the weight function:
The traditional weight function used for LOESS is the [[Kernel (statistics)#Kernel functions in common use|tri-cube weight function]],
# Non-negative: <math>wW(dx) => (10</math> -for <math>|dx|^3)^3 < 1</math>.
# Symmetry: <math>W(-x) = W(x)</math>.
where ''d'' is the distance of a given data point from the point on the curve being fitted, scaled to lie in the range from 0 to 1.<ref name="NIST" />
# Monotone: <math>W(x)</math> is a nonincreasing function for <math>x \ge 0</math>.
# Bounded support: <math>W(x)=0</math> for <math>|x| \ge 1</math>.
 
Asymptotic efficiency of weight functions has been considered by [[V. A. Epanechnikov]] (1969)<ref>{{citeQ|Q57308723}}</ref> in the context of kernel density estimation; J. Fan (1993)<ref>{{citeQ|Q132691957}}</ref> has derived similar results for local regression. They conclude that the quadratic kernel, <math>W(x) = 1-x^2</math> for <math>|x|\le1</math> has greatest efficiency under a mean-squared-error loss function. See [[Kernel (statistics)#Kernel functions in common use|"kernel functions in common use"]] for more discussion of different kernels and their efficiencies.
However, any other weight function that satisfies the properties listed in Cleveland (1979) could also be used. The weight for a specific point in any localized subset of data is obtained by evaluating the weight function at the distance between that point and the point of estimation, after scaling the distance so that the maximum absolute distance over all of the points in the subset of data is exactly one.
 
Considerations other than MSE are also relevant to the choice of weight function. Smoothness properties of <math>W(x)</math> directly affect smoothness of the estimate <math>\hat\mu(x)</math>. In particular, the quadaratic kernel is not differentiable at <math>x=\pm 1</math>, and <math>\hat\mu(x)</math> is not differentiable as a result.
Consider the following generalisation of the linear regression model with a metric <math>w(x,z)</math> on the target space <math>\mathbb R^m</math> that depends on two parameters, <math>x,z\in\mathbb R^p</math>. Assume that the linear hypothesis is based on <math>p</math> input parameters and that, as customary in these cases, we embed the input space <math>\mathbb R^p</math> into <math>\mathbb R^{p+1}</math> as <math>x\mapsto \hat x := (1,x)</math>, and consider the following ''[[loss function]]''
The traditional weight function used for LOESS is the [[Kernel (statistics)#Kernel functions in common use|tri-cube weight function]],
<math display="block">W(x) = (1 - |x|^3)^3; |x|<1</math>
has been used in LOWESS and other local regression software; this combines higher-order differentiability with a high MSE efficiency.
 
One criticism of weight functions with bounded support is that they can lead to numerical problems (i.e. an unstable or singular design matrix) when fitting in regions with sparse data. For this reason, some authors choose to use the Gaussian kernel, or others with unbounded support.
:<math>\operatorname{RSS}_x(A) = \sum_{i=1}^N(y_i-A\hat x_i)^Tw_i(x)(y_i-A\hat x_i).</math>
 
Here, <math>A</math> is an <math>m\times(p+1)</math> real matrix of coefficients, <math>w_i(x):=w(x_i,x)</math> and the subscript ''i'' enumerates input and output vectors from a training set. Since <math>w</math> is a metric, it is a symmetric, positive-definite matrix and, as such, there is another symmetric matrix <math>h</math> such that <math>w=h^2</math>. The above loss function can be rearranged into a trace by observing that
 
:<math>y^Twy = (hy)^T(hy) = \operatorname{Tr}(hyy^Th) = \operatorname{Tr}(wyy^T)</math>.
 
By arranging the vectors <math>y_i</math> and <math>\hat x_i</math> into the columns of a <math>m\times N</math> matrix <math>Y</math> and an <math>(p+1)\times N</math> matrix <math>\hat X</math> respectively, the above loss function can then be written as
 
:<math>\operatorname{Tr}(W(x)(Y-A\hat X)^T(Y-A\hat X))</math>
 
where <math>W</math> is the square diagonal <math>N\times N</math> matrix whose entries are the <math>w_i(x)</math>s. Differentiating with respect to <math>A</math> and setting the result equal to 0 one finds the extremal matrix equation
 
:<math>A\hat XW(x)\hat X^T = YW(x)\hat X^T</math>.
 
Assuming further that the square matrix <math>\hat XW(x)\hat X^T</math> is non-singular, the loss function <math>\operatorname{RSS}_x(A)</math> attains its minimum at
 
:<math>A(x) = YW(x)\hat X^T(\hat XW(x)\hat X^T)^{-1}</math>.
 
A typical choice for <math>w(x,z)</math> is the [[Gaussian function|Gaussian weight]]
 
:<math>w(x,z) = \exp\left(-\frac{\| x-z \|^2}{2\alpha^2}\right)</math>.
 
===Choice of Fitting Criterion===