Simple linear regression: Difference between revisions

Content deleted Content added
m Alternative formulas for the slope coefficient: merging with the previous section
Estimating the regression line: tidying up (somehow not a single bit could have been left unchanged)
Line 6:
The adjective ''simple'' refers to the fact that this regression is one of the simplest in statistics. The fitted line has the slope equal to the [[Pearson product moment correlation coefficient|correlation]] between ''y'' and ''x'' corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that it passes through the center of mass (<span style="text-decoration:overline">''x''</span>, <span style="text-decoration:overline">''y''</span>) of the data points.
 
== EstimatingFitting the regression line ==
Suppose there are ''n'' data points {''y''<sub>''i''</sub>, ''x''<sub>''i''</sub>}, where ''i''&nbsp;=&nbsp;1,&nbsp;2, …, ''n''. The goal is to find the equation of the straight line
 
: <math> \hat{a}y = \bar{y}alpha -+ \hat{b}beta \bar{x}, \,</math>
The parameters of the linear regression model, <math> Y_i = a + bX_i + \varepsilon_i </math>, can be estimated using the method of [[ordinary least squares]]. This method finds the line that minimizes the sum of the squares of errors, <math> \sum_{i = 1}^n \varepsilon_{i}^2 </math>.
which would provide a <!-- maybe indefinite article here will aggravate some of the grammar purists, but it attempts to convey the idea that there could be many different ways to define "best" fit --> “best” fit for the data points. Here the “best” will be understood as in the [[Ordinary least squares|least-squares]] approach: such a line that minimizes the sum of squared residuals of the model. In other words, numbers ''α'' and ''β'' solve the following minimization problem:
 
: <math>Q(\alpha,\beta) = \frac{\partial1}{\partial an} \sum_{i = 1}^n \varepsilon_i^2 = -2 \frac{1}{n}\sum_{i = 1}^n (y_i - a\alpha - b\beta x_i)^2\ \to\ \min_{\alpha,\,\beta}</math>
The minimization problem can be solved using calculus, producing the following formulas for the estimates of the regression parameters:
(The fraction 1/''n'' in front of the sum does not affect the minimization problem. It is there purely for the convenience of our later discussion of the asymptotic results).
 
Using simple [[calculus]] it can be shown that the values of ''α'' and ''β'' that minimize the objective function ''Q'' are
:<math> \hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} </math>
 
: <math>\begin{align}
:<math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
:<math> & \hat{b}\beta = \frac { \sum_{i=1}^{Nn} (x_{i} - \bar{x})(y_{i} - \bar{y}) }{ {\sum_{i=1}^{Nn} (x_{i} - \bar{x}) ^2 } </math>
= \frac{ \overline{xy} - \bar{x}\bar{y} }{ \overline{x^2} - \bar{x}^2 }
= \frac{ \operatorname{Cov}[x,y] }{ \operatorname{Var}[x] }
= r_{xy} \frac{s_y}{s_x}, \\
& \hat\alpha = \bar{y} - \hat\beta\,\bar{x},
\end{align}</math>
 
where ''r<sub>xy</sub>'' is the correlation coefficient between ''x'' and ''y'', ''s<sub>x</sub>'' is the [[standard deviation]] of ''x'', and ''s<sub>y</sub>'' is correspondingly the standard deviation of ''y''. Horizontal bar over a variable means the sample average of that variable. For example: <math style="height:1.5em">\overline{xy} = \tfrac{1}{n}\textstyle\sum_{i=1}^n x_iy_i\ .</math>
As usual, <math> \bar{x} = \sum_{i=1}^{N} x_{i} / N, \bar{y} = \sum_{i=1}^{N} y_{i} / N. </math>
 
Sometimes people consider a simple linear regression model without the intercept term: ''y''&nbsp;=&nbsp;''βx''. In such a case the OLS estimator for ''β'' will be given by formula
Ordinary least squares produces the following features:
 
: <span style="color:gray">(in the regression without the intercept):</span> <math>\hat\beta = \frac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^nx_i^2}.</math>
1. The line goes through the point <math> (\bar{x},\bar{y}) </math>. This is easily seen rearranging the expression <math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math> as <math> \bar{y} = \hat{a} + \hat{b} \bar{x} </math>, which shows that the point <math> (\bar{x},\bar{y}) </math> verifies the fitted regression equation.
 
=== Properties ===
2. The sum of the residuals is equal to zero, if the model includes a constant. To see why, minimize <math> \sum_{i = 1}^n \varepsilon_i^2 = \sum_{i = 1}^n (y_i - a - b x_i)^2 </math> with respect to ''a'' taking the following partial derivative:
# The line goes through the “center of mass” point (<i style="text-decoration:overline">x</i>, <i style="text-decoration:overline">y</i>).
 
2.# The sum of the residuals is equal to zero, if the model includes a constant. To see why, minimize: <math> \textstyle\sum_{i = 1}^n \hat\varepsilon_i^2 = \sum_{i = 1}^n (y_i - a - b x_i)^2 0.</math> with respect to ''a'' taking the following partial derivative:
:<math> \frac{\partial}{\partial a} \sum_{i = 1}^n \varepsilon_i^2 = -2 \sum_{i = 1}^n (y_i - a - b x_i) </math>
3.# The linear combination of the residuals in which the coefficients are the ''x''-values is equal to zero: <math>\textstyle\sum_{i=1}^nx_i\hat\varepsilon_i=0.</math>
 
# The estimators <math>\hat\alpha</math> and <math>\hat\beta</math> are [[Estimator bias|unbiased]]. This requires that we interpret the model stochastically, that is we have to assume that for each value of ''x'' the corresponding value of ''y'' is generated as a mean response ''α&nbsp;+&nbsp;βx'' plus an additional random variable ''ε'' called the ''error term''. This error term has to be equal to zero on average, for each value of ''x''. Under such interpretation the least-squares estimators <math>\hat\alpha</math> and <math>\hat\beta</math> will themselves be random variables, and they will unbiasedly estimate the “true values” ''α'' and ''β''.
:Setting this partial derivative to zero and noting that <math> \hat{\varepsilon}_i = y_i - \hat{a} - \hat{b} x_i </math> yields <math> \sum_{i = 1}^n \hat{\varepsilon}_i = 0 </math> as desired.
 
3. The linear combination of the residuals in which the coefficients are the ''x''-values is equal to zero.
 
4. The estimates are unbiased.
 
There are alternative (and simpler) formulas for calculating <math> \hat{b} </math>:
 
: <math> \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = r \frac {s_y}{s_x} = \frac {Covar(x,y)}{Var(x)}</math>
 
Here, r is the correlation coefficient of X and Y, s<sub>x</sub> is the sample [[standard deviation]] of X and s<sub>y</sub> is the sample standard deviation of Y.
 
== Inference ==