Simple linear regression

In statistics, simple linear regression is the least squares estimator of a linear regression model with a single predictor variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.

The adjective simple refers to the fact that this regression is one of the simplest in statistics. The fitted line has the slope equal to the correlation between y and x corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that it passes through the center of mass (x, y) of the data points.

Estimating the regression line

The parameters of the linear regression model, $Y_{i}=a+bX_{i}+\varepsilon _{i}$ , can be estimated using the method of ordinary least squares. This method finds the line that minimizes the sum of the squares of errors, $\sum _{i=1}^{n}\varepsilon _{i}^{2}$ .

The minimization problem can be solved using calculus, producing the following formulas for the estimates of the regression parameters:

{\hat {b}}={\frac {\sum _{i=1}^{N}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}}

{\hat {a}}={\bar {y}}-{\hat {b}}{\bar {x}}

As usual, ${\bar {x}}=\sum _{i=1}^{N}x_{i}/N,{\bar {y}}=\sum _{i=1}^{N}y_{i}/N.$

Ordinary least squares produces the following features:

1. The line goes through the point $({\bar {x}},{\bar {y}})$ . This is easily seen rearranging the expression ${\hat {a}}={\bar {y}}-{\hat {b}}{\bar {x}}$ as ${\bar {y}}={\hat {a}}+{\hat {b}}{\bar {x}}$ , which shows that the point $({\bar {x}},{\bar {y}})$ verifies the fitted regression equation.

2. The sum of the residuals is equal to zero, if the model includes a constant. To see why, minimize $\sum _{i=1}^{n}\varepsilon _{i}^{2}=\sum _{i=1}^{n}(y_{i}-a-bx_{i})^{2}$ with respect to a taking the following partial derivative:

{\frac {\partial }{\partial a}}\sum _{i=1}^{n}\varepsilon _{i}^{2}=-2\sum _{i=1}^{n}(y_{i}-a-bx_{i})

Setting this partial derivative to zero and noting that

{\hat {\varepsilon }}_{i}=y_{i}-{\hat {a}}-{\hat {b}}x_{i}

yields

\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}=0

as desired.

3. The linear combination of the residuals in which the coefficients are the x-values is equal to zero.

4. The estimates are unbiased.

Alternative formulas for the slope coefficient

There are alternative (and simpler) formulas for calculating ${\hat {b}}$ :

{\hat {b}}={\frac {\sum _{i=1}^{N}{(x_{i}y_{i})}-N{\bar {x}}{\bar {y}}}{\sum _{i=1}^{N}(x_{i})^{2}-N{\bar {x}}^{2}}}=r{\frac {s_{y}}{s_{x}}}={\frac {Covar(x,y)}{Var(x)}}

Here, r is the correlation coefficient of X and Y, s_x is the sample standard deviation of X and s_y is the sample standard deviation of Y.

Inference

Under the assumption that the error term is normally distributed, the estimate of the slope coefficient has a normal distribution with mean equal to b and standard error given by:

s_{\hat {b}}={\sqrt {\frac {\sum _{i=1}^{N}{\hat {\varepsilon _{i}}}^{2}/(N-2)}{\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}}}.

A confidence interval for b can be created using a t-distribution with N-2 degrees of freedom:

[{\hat {b}}-s_{\hat {b}}t_{N-2}^{*},{\hat {b}}+s_{\hat {b}}t_{N-2}^{*}]

Numerical example

Suppose we have the sample of points {(1,-1),(2,4),(6,3)}. The mean of X is 3 and the mean of Y is 2. The slope coefficient estimate is given by:

{\hat {b}}={\frac {(1-3)((-1)-2)+(2-3)(4-2)+(6-3)(3-2)}{(1-3)^{2}+(2-3)^{2}+(6-3)^{2}}}=7/14=0.5.

The standard error of the coefficient is 0.866. A 95% confidence interval is given by

[0.5 − 0.866 × 12.7062, 0.5 + 0.866 × 12.7062] = [−10.504, 11.504].

Mathematical derivation of the least squares estimates

Assume that $Y_{i}=\alpha +\beta X_{i}+\varepsilon$ is a stochastic simple regression model and let $(y_{i},x_{i}),\,i=1,\ldots ,n$ be a sample of size n. Here the sample is seen as observable nonrandom variables but the calculations don't change when assuming that the sample is represented by random variables $(Y_{1},X_{1}),\ldots ,(Y_{n},X_{n})$ .

Let Q be the sum of squared errors:

Q(\alpha ,\beta ):=\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}

Then taking partial derivatives with respect to $\alpha$ and $\beta$ :

{\begin{aligned}{\frac {\partial Q}{\partial \alpha }}(\alpha ,\beta )&=-2\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})\\{\frac {\partial Q}{\partial \beta }}(\alpha ,\beta )&=2\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})(-x_{i})\end{aligned}}

Setting ${\frac {\partial Q}{\partial \alpha }}(\alpha ,\beta )$ and ${\frac {\partial Q}{\partial \beta }}(\alpha ,\beta )$ to zero yields

{\begin{aligned}n{\hat {\alpha }}+{\hat {\beta }}\sum _{i=1}^{n}x_{i}&=\sum _{i=1}^{n}y_{i}\\{\hat {\alpha }}\sum _{i=1}^{n}x_{i}+{\hat {\beta }}\sum _{i=1}^{n}x_{i}^{2}&=\sum _{i=1}^{n}x_{i}y_{i}\end{aligned}}

which are known as the normal equations and can be written in matrix notation as

{\begin{pmatrix}n&\sum _{i=1}^{n}x_{i}\\\sum _{i=1}^{n}x_{i}&\sum _{i=1}^{n}x_{i}^{2}\end{pmatrix}}{\begin{pmatrix}{\hat {\alpha }}\\{\hat {\beta }}\end{pmatrix}}={\begin{pmatrix}\sum _{i=1}^{n}y_{i}\\\sum _{i=1}^{n}x_{i}y_{i}\end{pmatrix}}

Using Cramer's rule we get

{\begin{aligned}{\hat {\alpha }}={\frac {\sum _{i=1}^{n}y_{i}\sum _{i=1}^{n}x_{i}^{2}-\sum _{i=1}^{n}x_{i}y_{i}\sum _{i=1}^{n}x_{i}}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}\\{\hat {\beta }}={\frac {n\sum _{i=1}^{n}x_{i}y_{i}-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}y_{i}}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}\end{aligned}}

Dividing the last expression by n:

{\hat {\beta }}={\frac {\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}}}

Isolating ${\hat {\alpha }}$ from the first normal equation yields

{\begin{aligned}n{\hat {\alpha }}&=\sum _{i=1}^{n}y_{i}-\beta \sum _{i=1}^{n}x_{i}\\{\hat {\alpha }}&={\frac {1}{n}}\sum _{i=1}^{n}y_{i}-{\hat {\beta }}{\frac {1}{n}}\sum _{i=1}^{n}x_{i}\\&={\bar {y}}-{\hat {\beta }}{\bar {x}}\end{aligned}}

which is a common formula for ${\hat {\alpha }}$ in terms of ${\hat {\beta }}$ and the sample means.

${\hat {\beta }}$ may also be written as

{\hat {\beta }}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}

using the following equalities:

{\begin{aligned}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}&=\sum _{i=1}^{n}(x_{i}^{2}-2x_{i}{\bar {x}}+{\bar {x}}^{2})\\&=\sum _{i=1}^{n}x_{i}^{2}-2{\bar {x}}\underbrace {\sum _{i=1}^{n}x_{i}} _{n{\bar {x}}}+n{\bar {x}}^{2}\\&=\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}\\\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})&=\sum _{i=1}^{n}(x_{i}y_{i}-{\bar {y}}x_{i}-{\bar {x}}y_{i}+{\bar {x}}{\bar {y}})\\&=\sum _{i=1}^{n}x_{i}y_{i}-{\bar {y}}\underbrace {\sum _{i=1}^{n}x_{i}} _{n{\bar {x}}}-{\bar {x}}\underbrace {\sum _{i=1}^{n}y_{i}} _{n{\bar {y}}}+n{\bar {x}}{\bar {y}}\\&=\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {y}}{\bar {x}}-n{\bar {x}}{\bar {y}}+n{\bar {x}}{\bar {y}}\\&=\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {x}}{\bar {y}}\end{aligned}}

The following calculation shows that $({\hat {\alpha }},{\hat {\beta }})$ is a minimum.

{\begin{aligned}{\frac {\partial Q}{\partial \alpha }}(\alpha ,\beta )&=-2\sum _{i=1}^{n}y_{i}+2n\alpha +2\beta \sum _{i=1}^{n}x_{i}\\{\frac {\partial ^{2}Q}{\partial \alpha ^{2}}}(\alpha ,\beta )&=2n\\{\frac {\partial Q}{\partial \beta }}(\alpha ,\beta )&=-2\sum _{i=1}^{n}x_{i}y_{i}+2\alpha \sum _{i=1}^{n}x_{i}+2\beta \sum _{i=1}^{n}x_{i}^{2}\\{\frac {\partial ^{2}Q}{\partial \beta ^{2}}}(\alpha ,\beta )&=2\sum _{i=1}^{n}x_{i}^{2}\\{\frac {\partial ^{2}Q}{\partial \alpha \partial \beta }}(\alpha ,\beta )&={\frac {\partial ^{2}Q}{\partial \beta \partial \alpha }}(\alpha ,\beta )=2\sum _{i=1}^{n}x_{i}\end{aligned}}

Hence the Hessian matrix of Q is given by

{\begin{aligned}D^{2}Q(\alpha ,\beta )={\begin{pmatrix}2n&2\sum _{i=1}^{n}x_{i}\\2\sum _{i=1}^{n}x_{i}&2\sum _{i=1}^{n}x_{i}^{2}\end{pmatrix}}\\|D^{2}Q(\alpha ,\beta )|&=4n\sum _{i=1}^{n}x_{i}^{2}-4\left(\sum _{i=1}^{n}x_{i}\right)^{2}\\&=4n\sum _{i=1}^{n}x_{i}^{2}-4n^{2}{\bar {x}}^{2}\\&=4n\left(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}\right)\\&=4n\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}>0\end{aligned}}

Since $|D^{2}Q(\alpha ,\beta )|>0$ and $2n>0$ , $D^{2}Q(\alpha ,\beta )$ is positive definite for all $(\alpha ,\beta )$ and $({\hat {\alpha }},{\hat {\beta }})$ is a minimum.

The covariance matrix is given by the inverse of the Hessian matrix.