Content deleted Content added
Fgnievinski (talk | contribs) |
Fgnievinski (talk | contribs) No edit summary |
||
Line 95:
In this framing, when <math>x_i</math> is not actually a [[random variable]], what type of parameter does the empirical correlation <math>r_{xy}</math> estimate? The issue is that for each value i we'll have: <math>E(x_i)=x_i</math> and <math>Var(x_i)=0</math>. A possible interpretation of <math>r_{xy}</math> is to imagine that <math>x_i</math> defines a random variable drawn from the [[Empirical distribution function|empirical distribution]] of the x values in our sample. For example, if x had 10 values from the [[natural numbers]]: [1,2,3...,10], then we can imagine x to be a [[Discrete uniform distribution]]. Under this interpretation all <math>x_i</math> have the same expectation and some positive variance. With this interpretation we can think of <math>r_{xy}</math> as the estimator of the [[Pearson_correlation_coefficient#Definition|Pearson's correlation]] between the random variable y and the random variable x (as we just defined it).
===Simple linear regression without the intercept term (single regressor) ===▼
Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar|x}} and {{mvar|y}} are assumed to be proportional. For the model without the intercept term, {{math|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar|β}} simplifies to▼
: <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math>▼
Substituting {{math|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math|(''x'', ''y'')}} gives the regression through {{math|(''h'', ''k'')}}:▼
: <math>\begin{align}▼
\widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt]▼
&= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt]▼
&= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt]▼
&= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2},▼
\end{align}</math>▼
where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).▼
[[File:Fitting a straight line to a data with outliers.png|thumb|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]]▼
The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.▼
==Numerical properties==
Line 270 ⟶ 252:
: <math>\widehat{r} = \frac{nS_{xy} - S_xS_y}{\sqrt{(nS_{xx} - S_x^2)(nS_{yy} - S_y^2)}} = 0.9946</math>
==Alternatives==
▲[[File:Fitting a straight line to a data with outliers.png|thumb|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]]
{{excerpt|Line fitting}}
▲===Simple linear regression without the intercept term (single regressor) ===
▲Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar|x}} and {{mvar|y}} are assumed to be proportional. For the model without the intercept term, {{math|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar|β}} simplifies to
▲: <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math>
▲Substituting {{math|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math|(''x'', ''y'')}} gives the regression through {{math|(''h'', ''k'')}}:
▲: <math>\begin{align}
▲ \widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt]
▲ &= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt]
▲ &= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt]
▲ &= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2},
▲\end{align}</math>
▲where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).
▲The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.
==See also==
* [[Design matrix#Simple linear regression]]
▲* [[Line fitting]]
* [[Linear trend estimation]]
* [[Segmented regression|Linear segmented regression]]
|