Revision as of 01:49, 2 December 2023 edit Fgnievinski (talk \| contribs) Autopatrolled, Extended confirmed users 71,085 edits →Fitting the regression line ← Previous edit		Revision as of 01:55, 2 December 2023 edit undo Fgnievinski (talk \| contribs) Autopatrolled, Extended confirmed users 71,085 edits No edit summary Next edit →
Line 95: In this framing, when <math>x_i</math> is not actually a [[random variable]], what type of parameter does the empirical correlation <math>r_{xy}</math> estimate? The issue is that for each value i we'll have: <math>E(x_i)=x_i</math> and <math>Var(x_i)=0</math>. A possible interpretation of <math>r_{xy}</math> is to imagine that <math>x_i</math> defines a random variable drawn from the [[Empirical distribution function\|empirical distribution]] of the x values in our sample. For example, if x had 10 values from the [[natural numbers]]: [1,2,3...,10], then we can imagine x to be a [[Discrete uniform distribution]]. Under this interpretation all <math>x_i</math> have the same expectation and some positive variance. With this interpretation we can think of <math>r_{xy}</math> as the estimator of the [[Pearson_correlation_coefficient#Definition\|Pearson's correlation]] between the random variable y and the random variable x (as we just defined it). ===Simple linear regression without the intercept term (single regressor) ===▼ Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar\|x}} and {{mvar\|y}} are assumed to be proportional. For the model without the intercept term, {{math\|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar\|β}} simplifies to▼ : <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math>▼ Substituting {{math\|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math\|(''x'', ''y'')}} gives the regression through {{math\|(''h'', ''k'')}}:▼ : <math>\begin{align}▼ \widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt]▼ &= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt]▼ &= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt]▼ &= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2},▼ \end{align}</math>▼ where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).▼ [[File:Fitting a straight line to a data with outliers.png\|thumb\|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]]▼ The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.▼ ==Numerical properties== Line 270 ⟶ 252: : <math>\widehat{r} = \frac{nS_{xy} - S_xS_y}{\sqrt{(nS_{xx} - S_x^2)(nS_{yy} - S_y^2)}} = 0.9946</math> ==Alternatives== ▲[[File:Fitting a straight line to a data with outliers.png\|thumb\|Calculating the parameters of a linear model by minimizing the squared error can lead to a model that attempts to fit the outliers more than the data.]] === [[Line fitting]] ===▼ {{excerpt\|Line fitting}} ▲===Simple linear regression without the intercept term (single regressor) === ▲Sometimes it is appropriate to force the regression line to pass through the origin, because {{mvar\|x}} and {{mvar\|y}} are assumed to be proportional. For the model without the intercept term, {{math\|''y'' {{=}} ''βx''}}, the OLS estimator for {{mvar\|β}} simplifies to ▲: <math>\widehat{\beta} = \frac{ \sum_{i=1}^n x_i y_i }{ \sum_{i=1}^n x_i^2 } = \frac{\overline{x y}}{\overline{x^2}} </math> ▲Substituting {{math\|(''x'' − ''h'', ''y'' − ''k'')}} in place of {{math\|(''x'', ''y'')}} gives the regression through {{math\|(''h'', ''k'')}}: ▲: <math>\begin{align} ▲ \widehat\beta &= \frac{ \sum_{i=1}^n (x_i - h) (y_i - k) }{ \sum_{i=1}^n (x_i - h)^2 } = \frac{\overline{(x - h) (y - k)}}{\overline{(x - h)^2}} \\[6pt] ▲ &= \frac{\overline{x y} - k \bar{x} - h \bar{y} + h k }{\overline{x^2} - 2 h \bar{x} + h^2} \\[6pt] ▲ &= \frac{\overline{x y} - \bar{x} \bar{y} + (\bar{x} - h)(\bar{y} - k)}{\overline{x^2} - \bar{x}^2 + (\bar{x} - h)^2} \\[6pt] ▲ &= \frac{\operatorname{Cov}(x,y) + (\bar{x} - h)(\bar{y}-k)}{\operatorname{Var}(x) + (\bar{x} - h)^2}, ▲\end{align}</math> ▲where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias). ▲The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope. ==See also== [[Design matrix#Simple linear regression]] ▲* [[Line fitting]] * [[Linear trend estimation]] * [[Segmented regression\|Linear segmented regression]]

Simple linear regression: Difference between revisions