Simple linear regression: Difference between revisions

Content deleted Content added
No edit summary
various cleanups
Line 11:
The linear relationship between the two variables (i.e. dependent and independent) can be measured using a correlation coefficient e.g. the [[Pearson product moment correlation coefficient]].
 
== Estimating the Regressionregression Lineline ==
 
The parameters of the linear regression line, <math>Y = a + bX</math>, can be estimated using the method of [[Ordinaryordinary Leastleast Squaressquares]]. This method finds the line that minimizes the sum of the squares of the regression residuals, <math> \sum_{i=1}^N \hat{\varepsilon}_{i}^2 </math>. The residual is the difference between the observed value and the predicted value: <math> \hat{\varepsilon} _{i} = y_{i} - \hat{y}_{i} </math>
== Estimating the Regression Line ==
 
The parameters of the linear regression line, <math>Y = a + bX</math>, can be estimated using the method of [[Ordinary Least Squares]]. This method finds the line that minimizes the sum of the squares of the regression residuals, <math> \sum_{i=1}^N \hat{\varepsilon}_{i}^2 </math>. The residual is the difference between the observed value and the predicted value: <math> \hat{\varepsilon} _{i} = y_{i} - \hat{y}_{i} </math>
 
The minimization problem can be solved using calculus, producing the following formulas for the estimates of the regression parameters:
 
: <math> \hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} </math>
 
: <math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
 
Ordinary Least Squares produces the following features:
Line 30 ⟶ 29:
There are alternative (and simpler) formulas for calculating <math> \hat{b} </math>:
 
: <math> \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = r \frac {s_y}{s_x} </math>
 
Here, r is the correlation coefficient of X and Y, s<sub>x</sub> is the sample standard deviation of X and s<sub>y</sub> is the sample standard deviation of Y.
Line 38 ⟶ 37:
Under the assumption that the error term is normally distributed, the estimate of the slope coefficient has a normal distribution with mean equal to '''b''' and standard error given by:
 
: <math> s_ \hat{b} = \sqrt { \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2 /(N-2)} {\sum_{i=1}^N (x_i - \bar{x})^2} }.</math>.
 
 
A confidence interval for ''b'' can be created using a t-distribution with N-2 degrees of freedom:
 
: <math> [ \hat{b} - s_ \hat{b} t_{N-2}^*,\hat{b} + s_ \hat{b} t_{N-2}^*] </math>
 
== Numerical Exampleexample ==
 
Suppose we have the sample of points {(1,-1),(2,4),(6,3)}. The mean of X is 3 and the mean of Y is 2. The slope coefficient estimate is given by:
 
: <math> \hat{b} = \frac {(1 - 3)((-1) - 2) + (2 - 3)(4 - 2) + (6 - 3)(3 - 2)} {(1 - 3)^2 + (2 - 3)^2 + (6 - 3)^2 } = 7/14 = 0.5 </math>
 
The standard error of the coefficient is 0.866. A 95% confidence interval is given by:
 
: [0.5 -&minus; 0.866 x&times; 12.7062, 0.5 + 0.866 x&times; 12.7062] = [-&minus;10.504, 11.504].
 
[[Category:Regression analysis]]