Simple linear regression: Difference between revisions

Content deleted Content added
No edit summary
Line 95:
 
In this framing, when <math>x_i</math> is not actually a [[random variable]], what type of parameter does the empirical correlation <math>r_{xy}</math> estimate? The issue is that for each value i we'll have: <math>E(x_i)=x_i</math> and <math>Var(x_i)=0</math>. A possible interpretation of <math>r_{xy}</math> is to imagine that <math>x_i</math> defines a random variable drawn from the [[Empirical distribution function|empirical distribution]] of the x values in our sample. For example, if x had 10 values from the [[natural numbers]]: [1,2,3...,10], then we can imagine x to be a [[Discrete uniform distribution]]. Under this interpretation all <math>x_i</math> have the same expectation and some positive variance. With this interpretation we can think of <math>r_{xy}</math> as the estimator of the [[Pearson_correlation_coefficient#Definition|Pearson's correlation]] between the random variable y and the random variable x (as we just defined it).
 
===Simple linear regression without the intercept term (single regressor) ===
Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar|x}} and {{mvar|y}} are assumed to be proportional. For the model without the intercept term, {{math|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar|β}} simplifies to
 
: <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math>
 
Substituting {{math|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math|(''x'', ''y'')}} gives the regression through {{math|(''h'', ''k'')}}:
 
: <math>\begin{align}
\widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt]
&= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt]
&= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt]
&= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2},
\end{align}</math>
 
where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).
[[File:Fitting a straight line to a data with outliers.png|thumb|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]]
The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.
 
==Numerical properties==
Line 270 ⟶ 252:
 
: <math>\widehat{r} = \frac{nS_{xy} - S_xS_y}{\sqrt{(nS_{xx} - S_x^2)(nS_{yy} - S_y^2)}} = 0.9946</math>
 
==Alternatives==
[[File:Fitting a straight line to a data with outliers.png|thumb|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]]
 
*=== [[Line fitting]] ===
{{excerpt|Line fitting}}
 
===Simple linear regression without the intercept term (single regressor) ===
Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar|x}} and {{mvar|y}} are assumed to be proportional. For the model without the intercept term, {{math|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar|β}} simplifies to
 
: <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math>
 
Substituting {{math|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math|(''x'', ''y'')}} gives the regression through {{math|(''h'', ''k'')}}:
 
: <math>\begin{align}
\widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt]
&= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt]
&= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt]
&= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2},
\end{align}</math>
 
where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).
The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.
 
==See also==
* [[Design matrix#Simple linear regression]]
* [[Line fitting]]
* [[Linear trend estimation]]
* [[Segmented regression|Linear segmented regression]]