Simple linear regression: Difference between revisions

Content deleted Content added
No edit summary
Line 1:
A '''simple linear regression''' is a [[linear regression]] in which there is only one [[covariate]] (predictor variable). Simple linear regression is a form of [[multiple regression]].
 
Simple linear regression is used in situations to evaluate the linear relationship between two variables. One example could be the relationship between muscle strength and lean body mass. Another way to put it is that simple linear regression is used to develop an equation by which we can predict or estimate a dependent variable given an independent variable.
Line 5:
The regression equation is given by
 
<math>Y = a + bX + \varepsilon </math>
 
Where <math>Y</math> is the dependent variable, <math>a</math> is the y intercept, <math>b</math> is the gradient or slope of the line and, <math>X</math> is independent variable and <math> \varepsilon </math> is a random term.
The linear relationship between the two variables (i.e. dependent and independent) can be measured using a correlation coefficient e.g. the [[Pearson product moment correlation coefficient]].
Line 18:
== Estimating the Regression Line ==
 
The parameters of the linear regression line, <math>Y = a + bX</math>, can be estimated using the method of [[Ordinary Least Squares]]. This method finds the line that minimizes the sum of the squares of the regression residuals, <math> \sum_{i=1}^N \hat{\varepsilon}_{i}^2 </math>. The residual is the difference between the observed value and the predicted value: <math> \hat{\varepsilon} _{i} = y_{i} - \hat{y}_{i} </math>
The SLR(simple linear regression) line, <math>Y = a + bX</math>, is normally determined as an estimate from a collection of sample data values consisting of <math>X</math> values in the scope of the experiment and the corresponding <math>Y</math> values observed. One common way of estimating the line is the Method of Least Squares. The goal of this method is to create a line that minimizes the summation of the residual error squared. The residual error values are the distances of each sample data point from the resulting best fit line. An example of the graphical representation of residual error is shown below:
 
The minimization problem can be solved using calculus, producing the following formulas for the estimates of the regression parameters:
Let us use e<sub>i</sub> to represent each residual error, y<sub>i</sub> to represent each observed value of y, and ŷ<sub>i</sub> to represent the value of <math>Y</math> on the estimated line for each y<sub>i</sub> . The method of least squares involves minimizing '''Σe<sub>i</sub> = Σ(y<sub>i</sub> - ŷ<sub>i</sub>)<sup>2</sup>'''.This is done using partial derivatives, which yield the following formulas for a(y intercept estimate) and b(slope estimate):
 
<math> \hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} </math>
'''b = ( nΣx<sub>i</sub>y<sub>i</sub> - (Σx<sub>i</sub>)(Σy<sub>i</sub>) ) / ( nΣx<sub>i</sub><sup>2</sup> - (Σx<sub>i</sub>)<sup>2</sup> )'''
 
<math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
'''a = ( Σy<sub>i</sub> - bΣx<sub>i</sub>) / n'''
 
The line created using the Method ofOrdinary Least Squares aboveproduces is characterized by twothe distinctfollowing features:
#Always The line goes through the point (X<supmath>bar</sup>, Y<sup>(\bar</sup>){X}, where X<sup>\bar</sup> and {Y<sup>bar</sup>}) are the average of all sample data x<sub>i</sub> and y<sub>i</submath>
#Residual errorsThe aresum split so thatof the positive residuals cancelis theequal negativeto residualszero
# The estimates are unbiased
 
== Alternative formulas for the slope coefficient ==
There are alternative (and simpler) formulas for calculating <math> \hat{b} </math>:
 
<math> \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = r \frac {s_y}{s_x} </math>
 
Here, r is the correlation coefficient of X and Y, s<sub>x</sub> is the sample standard deviation of X and s<sub>y</sub> is the sample standard deviation of Y.
 
== Inference ==
 
Under the assumption that the error term is normally distributed, the estimate of the slope coefficient has a normal distribution with mean equal to '''b''' and standard error given by:
 
<math> SE_ \hat{b} = \sqrt { \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2 /(N-2)} {\sum_{i=1}^N (x_i - \bar{x})^2} }</math>.