Simple linear regression: Difference between revisions

Content deleted Content added
See also: Newey–West estimator
Line 151:
To formalize this assertion we must define a framework in which these estimators are random variables. We consider the residuals {{math|''ε''<sub>i</sub>}} as random variables drawn independently from some distribution with mean zero. In other words, for each value of {{mvar|x}}, the corresponding value of {{mvar|y}} is generated as a mean response {{math|''α'' + ''βx''}} plus an additional random variable {{mvar|ε}} called the ''error term'', equal to zero on average. Under such interpretation, the least-squares estimators <math>\widehat\alpha</math> and <math>\widehat\beta</math> will themselves be random variables whose means will equal the "true values" {{mvar|α}} and {{mvar|β}}. This is the definition of an unbiased estimator.
 
=== Variance of the mean response ===
===Confidence intervals===
Since the data in this context is defined to be (''x'', ''y'') pairs for every observation, the ''mean response'' at a given value of ''x'', say ''x<sub>d</sub>'', is an estimate of the mean of the ''y'' values in the population at the ''x'' value of ''x<sub>d</sub>'', that is <math>\hat{E}(y \mid x_d) \equiv\hat{y}_d\!</math>. The variance of the mean response is given by
{{further|Variance of the mean and predicted responses}}
 
: <math>\operatorname{Var}\left(\hat{\alpha} + \hat{\beta}x_d\right) = \operatorname{Var}\left(\hat{\alpha}\right) + \left(\operatorname{Var} \hat{\beta}\right)x_d^2 + 2 x_d \operatorname{Cov} \left(\hat{\alpha}, \hat{\beta} \right) .</math>
 
This expression can be simplified to
 
:<math>\operatorname{Var}\left(\hat{\alpha} + \hat{\beta}x_d\right) =\sigma^2\left(\frac{1}{m} + \frac{\left(x_d - \bar{x}\right)^2}{\sum (x_i - \bar{x})^2}\right),</math>
 
where ''m'' is the number of data points.
 
To demonstrate this simplification, one can make use of the identity
 
: <math>\sum (x_i - \bar{x})^2 = \sum x_i^2 - \frac 1 m \left(\sum x_i\right)^2 .</math>
 
=== Variance of the predicted response ===
The ''predicted response'' distribution is the predicted distribution of the residuals at the given point ''x<sub>d</sub>''. So the variance is given by
 
: <math>
\begin{align}
\operatorname{Var}\left(y_d - \left[\hat{\alpha} + \hat{\beta} x_d \right] \right) &= \operatorname{Var} (y_d) + \operatorname{Var} \left(\hat{\alpha} + \hat{\beta}x_d\right) - 2\operatorname{Cov}\left(y_d,\left[\hat{\alpha} + \hat{\beta} x_d \right]\right)\\
&= \operatorname{Var} (y_d) + \operatorname{Var} \left(\hat{\alpha} + \hat{\beta}x_d\right).
\end{align}
</math>
 
The second line follows from the fact that <math>\operatorname{Cov}\left(y_d,\left[\hat{\alpha} + \hat{\beta} x_d \right]\right)</math> is zero because the new prediction point is independent of the data used to fit the model. Additionally, the term <math>\operatorname{Var} \left(\hat{\alpha} + \hat{\beta}x_d\right)</math> was calculated earlier for the mean response.
 
Since <math>\operatorname{Var}(y_d)=\sigma^2</math> (a fixed but unknown parameter that can be estimated), the variance of the predicted response is given by
 
: <math>
\begin{align}
\operatorname{Var}\left(y_d - \left[\hat{\alpha} + \hat{\beta} x_d \right] \right) & = \sigma^2 + \sigma^2\left(\frac 1 m + \frac{\left(x_d - \bar{x}\right)^2}{\sum (x_i - \bar{x})^2}\right)\\[4pt]
& = \sigma^2\left(1 + \frac 1 m + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}\right).
\end{align}
</math>
 
 
===Confidence intervals===
The formulas given in the previous section allow one to calculate the ''point estimates'' of {{mvar|α}} and {{mvar|β}} — that is, the coefficients of the regression line for the given set of data. However, those formulas do not tell us how precise the estimates are, i.e., how much the estimators <math>\widehat{\alpha}</math> and <math>\widehat{\beta}</math> vary from sample to sample for the specified sample size. [[Confidence interval]]s were devised to give a plausible set of values to the estimates one might have if one repeated the experiment a very large number of times.