Projection matrix: Difference between revisions

Content deleted Content added
Properties: link to proof of the claim
 
(43 intermediate revisions by 27 users not shown)
Line 1:
{{Short description|Concept in statistics}}
In [[statistics]], the '''projection matrix''' (<math>\mathbf{P}</math>),<ref>{{cite book |first=Alexander |last=Basilevsky |title=Applied Matrix Algebra in the Statistical Sciences |___location= |publisher=Dover |year=2005 |isbn=0-486-44538-0 |pages=160–176 |url=https://books.google.com/books?id=ScssAwAAQBAJ&pg=PA160 }}</ref> sometimes also called the '''influence matrix'''<ref>{{cite web |title=Data Assimilation: Observation influence diagnostic of a data assimilation system |url=http://old.ecmwf.int/newsevents/training/lecture_notes/pdf_files/ASSIM/ObservationInfluence.pdf }}{{dead link|date=April 2018 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> or '''hat matrix''' (<math>\mathbf{H}</math>), maps the vector of [[response variable|response values]] (dependent variable values) to the vector of [[fitted value]]s (or predicted values). It describes the [[influence function (statistics)|influence]] each response value has on each fitted value.<ref name="Hoaglin1977" >{{Cite journal | title = The Hat Matrix in Regression and ANOVA
{{For|the linear transformation|Projection (linear algebra)}}
| first1= David C. | last1= Hoaglin |first2= Roy E. | last2=Welsch |journal= [[The American Statistician]] | volume=32 |date=February 1978| pages=17–22 | doi = 10.2307/2683469 |issue=1| jstor = 2683469 |url=http://dspace.mit.edu/bitstream/1721.1/1920/1/SWP-0901-02752210.pdf }}</ref><ref name = "Freedman09">{{cite book |author=[[David A. Freedman]] |year=2009|title=Statistical Models: Theory and Practice |publisher=[[Cambridge University Press]]|quote= |page=}}</ref> The diagonal elements of the projection matrix are the [[leverage (statistics)|leverage]]s, which describe the influence each response value has on the fitted value for that same observation.
In [[statistics]], the '''projection matrix''' (<math>(\mathbf{P})</math>),<ref>{{cite book |first=Alexander |last=Basilevsky |title=Applied Matrix Algebra in the Statistical Sciences |___location= |publisher=Dover |year=2005 |isbn=0-486-44538-0 |pages=160–176 |url=https://books.google.com/books?id=ScssAwAAQBAJ&pg=PA160 }}</ref> sometimes also called the '''influence matrix'''<ref>{{cite web |title=Data Assimilation: Observation influence diagnostic of a data assimilation system |url=http://old.ecmwf.int/newsevents/training/lecture_notes/pdf_files/ASSIM/ObservationInfluence.pdf }}{{dead link|datearchive-url=April 2018https://web.archive.org/web/20140903115021/http://old.ecmwf.int/newsevents/training/lecture_notes/pdf_files/ASSIM/ObservationInfluence.pdf |boturl-status=InternetArchiveBotdead |fixarchive-attempteddate=yes2014-09-03 }}</ref> or '''hat matrix''' (<math>(\mathbf{H})</math>), maps the vector of [[response variable|response values]] (dependent variable values) to the vector of [[fitted value]]s (or predicted values). It describes the [[influence function (statistics)|influence]] each response value has on each fitted value.<ref name="Hoaglin1977" >{{Cite journal | title = The Hat Matrix in Regression and ANOVA
| first1= David C. | last1= Hoaglin |first2= Roy E. | last2=Welsch |journal= [[The American Statistician]] | volume=32 |date=February 1978| pages=17–22 | doi = 10.2307/2683469 |issue=1| jstor = 2683469 |url=http://dspace.mit.edu/bitstream/1721.1/1920/1/SWP-0901-02752210.pdf | hdl= 1721.1/1920 | hdl-access= free }}</ref><ref name = "Freedman09">{{cite book |author=[[David A. Freedman]] |author-link=David A. Freedman |year=2009|title=Statistical Models: Theory and Practice |publisher=[[Cambridge University Press]]|quote= |page=}}</ref> The diagonal elements of the projection matrix are the [[leverage (statistics)|leverage]]s, which describe the influence each response value has on the fitted value for that same observation.
 
==OverviewDefinition==
If the vector of [[Response variable|response values]] is denoted by <math>\mathbf{y}</math> and the vector of fitted values by <math>\mathbf{\hat{y}}</math>,
:<math>\mathbf{\hat{y}} = \mathbf{P} \mathbf{y}.</math>
As <math>\mathbf{\hat{y}}</math> is usually pronounced "y-hat", the projection matrix <math>\mathbf{P}</math> is also named ''hat matrix'' as it "puts a [[circumflex|hat]] on <math>\mathbf{y}</math>".

==Application for residuals==
The formula for the vector of [[errors and residuals in statistics|residual]]s <math>\mathbf{ur}</math> can also be expressed compactly using the projection matrix:
:<math>\mathbf{ur} = \mathbf{y} - \mathbf{\hat{y}} = \mathbf{y} - \mathbf{P} \mathbf{y} = \left( \mathbf{I} - \mathbf{P} \right) \mathbf{y}.</math>
where <math>\mathbf{I}</math> is the [[identity matrix]]. The matrix <math>\mathbf{M} \equiv \left(:= \mathbf{I} - \mathbf{P} \right)</math> is sometimes referred to as the '''residual maker matrix'''. Moreover, the element inor the ''i''th row andannihilator matrix''j''th. column of <math>\mathbf{P}</math> is equal to the [[covariance]] between the ''j''th response value and the ''i''th fitted value, divided by the [[variance]] of the former:
 
:<math>
Therefore, theThe [[covariance matrix]] of the residuals <math>\mathbf{ur}</math>, by [[error propagation]], equals
\begin{align}
p_:<math>\mathbf{\Sigma}_\mathbf{ijr} = \operatorname{Cov}\left[( \hatmathbf{yI}_i, y_j- \mathbf{P} \right] /)^\textsf{T} \operatornamemathbf{Var\Sigma} \left[y_j( \mathbf{I}-\mathbf{P} \right])</math>,
\end{align}
</math>
Therefore, the [[covariance matrix]] of the residuals <math>\mathbf{u}</math>, by [[error propagation]], equals
:<math>\mathbf{\Sigma}_\mathbf{u} = \left( \mathbf{I}-\mathbf{P} \right)^{\mathsf{T}} \mathbf{\Sigma} \left( \mathbf{I}-\mathbf{P} \right)</math>,
where <math>\mathbf{\Sigma}</matH> is the [[covariance matrix]] of the error vector (and by extension, the response vector as well). For the case of linear models with [[independent and identically distributed]] errors in which <math>\mathbf{\Sigma} = \sigma^{2} \mathbf{I}</math>, this reduces to:<ref name="Hoaglin1977"/>
:<math>\mathbf{\Sigma}_\mathbf{ur} = \left( \mathbf{I} - \mathbf{P} \right) \sigma^{2}</math>.
 
==Intuition==
[[File:Projection of a vector onto the column space of a matrix.svg|thumb|A matrix, <math>\mathbf{A}</math> has its column space depicted as the green line. The projection of some vector <math>\mathbf{b}</math> onto the column space of <math>\mathbf{A}</math> is the vector <math>\mathbf{x}</math>]]
 
From the figure, it is clear that the closest point from the vector <math>\mathbf{b}</math> onto the column space of <math>\mathbf{A}</math>, is <math>\mathbf{Ax}</math>, and is one where we can draw a line orthogonal to the column space of <math>\mathbf{A}</math>. A vector that is orthogonal to the column space of a matrix is in the nullspace of the matrix transpose, so
:<math>\mathbf{A}^\textsf{T}(\mathbf{b}-\mathbf{Ax}) = 0</math>.
 
From there, one rearranges, so
:<math>\begin{align}
&& \mathbf{A}^\textsf{T}\mathbf{b} &- \mathbf{A}^\textsf{T}\mathbf{Ax} = 0 \\
\Rightarrow && \mathbf{A}^\textsf{T}\mathbf{b} &= \mathbf{A}^\textsf{T}\mathbf{Ax} \\
\Rightarrow && \mathbf{x} &= \left(\mathbf{A}^\textsf{T}\mathbf{A}\right)^{-1}\mathbf{A}^\textsf{T}\mathbf{b}
\end{align}</math>.
 
Therefore, since <math>\mathbf{Ax}</math> is on the column space of <math>\mathbf{A}</math>, the projection matrix, which maps <math>\mathbf{b}</math> onto <math>\mathbf{x}</math>, is <math>\mathbf{A}\left(\mathbf{A}^\textsf{T}\mathbf{A}\right)^{-1}\mathbf{A}^\textsf{T}</math>.
 
== Linear model ==
Suppose that we wish to estimate a linear model using linear least squares. The model can be written as
:<math>\mathbf{y} = \mathbf{X} \boldsymbol \beta + \boldsymbol \varepsilon,</math>
where ''<math>\mathbf{X''}</math> is a matrix of [[explanatory variable]]s (the [[design matrix]]), '''''β''''' is a vector of unknown parameters to be estimated, and '''''ε''''' is the error vector.
 
Many types of models and techniques are subject to this formulation. A few examples are [[linear least squares (mathematics)|linear least squares]], [[smoothing splines]], [[regression splines]], [[local regression]], [[kernel regression]], and [[linear filter]]ing.
Line 29 ⟶ 45:
When the weights for each observation are identical and the [[errors and residuals in statistics|errors]] are uncorrelated, the estimated parameters are
 
:<math>\hat{\boldsymbol \beta} = \left( \mathbf{X}^{\mathsftextsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsftextsf{T}} \mathbf{y},</math>
 
so the fitted values are
 
:<math>\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol \beta} = \mathbf{X} \left( \mathbf{X}^{\mathsftextsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsftextsf{T}} \mathbf{y}.</math>
 
Therefore, the projection matrix (and hat matrix) is given by
 
:<math>\mathbf{P} \equiv:= \mathbf{X} \left(\mathbf{X}^{\mathsftextsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsftextsf{T}}.</math>
 
=== Weighted and generalized least squares ===
{{further|Weighted least squares|Generalized least squares}}
The above may be generalized to the cases where the weights are not identical and/or the errors are correlated. Suppose that the [[covariance matrix]] of the errors is Ψ'''Σ'''. Then since
 
: <math>
\hat{\mathbf\beta}_{\text{GLS}}= \left( \mathbf{X}^\mathsftextsf{T} \mathbf{\PsiSigma}^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^\mathsftextsf{T} \mathbf{\PsiSigma}^{-1}\mathbf{y}
</math>.
 
Line 50 ⟶ 66:
 
: <math>
\mathbf{H} = \mathbf{X}\left( \mathbf{X}^\mathsftextsf{T} \mathbf{\PsiSigma}^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^\mathsftextsf{T} \mathbf{\PsiSigma}^{-1}
</math>
 
and again it may be seen that <math>H^2 = H\cdot H = H</math>, though now it is no longer symmetric.
 
== Properties ==
The projection matrix has a number of useful algebraic properties.<ref>{{cite book |last=Gans |first=P. |year=1992 |title=Data Fitting in the Chemical Sciences |url=https://archive.org/details/datafittinginche0000gans |url-access=registration |publisher=Wiley |isbn=0-471-93412-7 }}</ref><ref>{{cite book |last=Draper |first=N. R. |last2=Smith |first2=H. |year=1998 |title=Applied Regression Analysis |___location= |publisher=Wiley |isbn=0-471-17082-8 }}</ref> In the language of [[linear algebra]], the projection matrix is the [[orthogonal projection]] onto the [[column space]] of the design matrix <math>\mathbf{X}</math>.<ref name = "Freedman09" /> (Note that <math>\left( \mathbf{X}^{\mathsftextsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsftextsf{T}}</math> is the [[Moore–Penrose pseudoinverse#Full rank|pseudoinverse of X]].) Some facts of the projection matrix in this setting are summarized as follows:<ref name = "Freedman09" />
* <math>\mathbf{u} = (\mathbf{I} - \mathbf{P})\mathbf{y},</math> and <math>\mathbf{u} = \mathbf{y} - \mathbf{P} \mathbf{y} \perp \mathbf{X}.</math>
* <math>\mathbf{P}</math> is symmetric, and so is <math>\mathbf{M} \equiv \left(:= \mathbf{I} - \mathbf{P} \right)</math>.
* <math>\mathbf{P}</math> is idempotent: <math>\mathbf{P}^2 = \mathbf{P}</math>, and so is <math>\mathbf{M}</math>.
* If <math>\mathbf{X}</math> is an {{nowrap|''n'' × ''r''}} matrix with <math>\operatorname{rank}(\mathbf{X}) = r</math>, then <math>\operatorname{rank}(\mathbf{P}) = r</math>
* The [[eigenvalue]]s of <math>\mathbf{P}</math> consist of ''r'' ones and {{nowrap|''n'' − ''r''}} zeros, while the eigenvalues of <math>\mathbf{M}</math> consist of {{nowrap|''n'' − ''r''}} ones and ''r'' zeros.<ref>{{cite book |first=Takeshi |last=Amemiya |title=Advanced Econometrics |___location=Cambridge |publisher=Harvard University Press |year=1985 |isbn=0-674-00560-0 |pages=460–461[https://archive.org/details/advancedeconomet00amem/page/460 460]–461 |url=https://booksarchive.google.comorg/books?id=0bzGQE14CwEC&pgdetails/advancedeconomet00amem |url-access=PA460registration }}</ref>
* <math>\mathbf{X}</math> is invariant under <math>\mathbf{P}</math> : <math>\mathbf{P X} = \mathbf{X},</math> hence <math>\left( \mathbf{I} - \mathbf{P} \right) \mathbf{X} = \mathbf{0}</math>.
* <math>\left( \mathbf{I} - \mathbf{P} \right) \mathbf{P} = \mathbf{P} \left( \mathbf{I} - \mathbf{P} \right) = \mathbf{0}.</math>
Line 67 ⟶ 83:
The projection matrix corresponding to a [[linear model]] is [[symmetric matrix|symmetric]] and [[idempotent matrix|idempotent]], that is, <math>\mathbf{P}^2 = \mathbf{P}</math>. However, this is not always the case; in [[local regression|locally weighted scatterplot smoothing (LOESS)]], for example, the hat matrix is in general neither symmetric nor idempotent.
 
For [[linear models]], the [[trace (linear algebra)|trace]] of the projection matrix is equal to the [[rank (linear algebra)|rank]] of <math>\mathbf{X}</math>, which is the number of independent parameters of the linear model .<ref>[https://math.stackexchange.com/questions/1582567/proof-that-trace-of-hat-matrix-in-linear-regression-is-rank-of-x{{cite web |title=Proof that trace of 'hat' matrix in linear regression is rank of X] (|work=Stack Exchange |date=April 13, 2017 |url=https://math.stackexchange.com)/q/1582567 }}</ref>. For other models such as LOESS that are still linear in the observations <math>\mathbf{y}</math>, the projection matrix can be used to define the [[degrees of freedom (statistics)#Effective degrees of freedom|effective degrees of freedom]] of the model.
 
Practical applications of the projection matrix in regression analysis include [[Leverage (statistics)|leverage]] and [[Cook's distance]], which are concerned with identifying [[influential observation]]s, i.e. observations which have a large effect on the results of a regression.
Line 73 ⟶ 89:
== Blockwise formula ==
 
Suppose the design matrix <math>\mathbf{X}</math> can be decomposed by columns as <math>\mathbf{X} = [\begin{bmatrix} \mathbf{A~~~} & \mathbf{B]} \end{bmatrix}</math>.
Define the hat or projection operator as <math>\mathbf{P}[\mathbf{X\}] := \mathbf{X} \left(\mathbf{X}^\mathsftextsf{T} \mathbf{X} \right)^{-1} \mathbf{X}^\mathsftextsf{T}</math>. Similarly, define the residual operator as <math>\mathbf{M}[\mathbf{X\}] := \mathbf{I} - \mathbf{P}[\mathbf{X\}]</math>.
Then the projection matrix can be decomposed as follows:<ref>{{cite book|last1=Rao|first1=C. Radhakrishna|last2=Toutenburg|first2=Helge|author3=Shalabh|first4=Christian|last4=Heumann|title=Linear Models and Generalizations|url=https://archive.org/details/linearmodelsgene00raop|url-access=limited|year=2008|publisher=Springer|___location=Berlin|isbn=978-3-540-74226-5|pagespage=[https://archive.org/details/linearmodelsgene00raop/page/n335 323]|edition=3rd}}</ref>
:<math> \mathbf{\SigmaP}_[\mathbf{uX}] = \left( \mathbf{IP}-[\mathbf{PA}] + \right)^mathbf{P}\mathsfbig[\mathbf{TM}} [\mathbf{\SigmaA} \left(] \mathbf{IB}-\mathbf{P}big], \right)</math>,
:<math>
where, e.g., <math>\mathbf{P}[\mathbf{A\}] = \mathbf{A} \left(\mathbf{A}^\mathsftextsf{T} \mathbf{A} \right)^{-1} \mathbf{A}^\mathsftextsf{T}</math> and <math>\mathbf{M}[\mathbf{A\}] = \mathbf{I} - \mathbf{P}[\mathbf{A\}]</math>.
P\{X\} = P\{A\} + P\{M\{A\} B\},
There are a number of applications of such a decomposition. In the classical application <math>\mathbf{A}</math> is a column of all ones, which allows one to analyze the effects of adding an intercept term to a regression. Another use is in the [[fixed effects model]], where <math>\mathbf{A}</math> is a large [[sparse matrix]] of the dummy variables for the fixed effect terms. One can use this partition to compute the hat matrix of <math>\mathbf{X }</math> without explicitly forming the matrix <math>\mathbf{X}</math>, which might be too large to fit into computer memory.
</math>
==History==
where, e.g., <math>P\{A\} = A \left(A^\mathsf{T} A \right)^{-1} A^\mathsf{T}</math> and <math>M\{A\} = I - P\{A\}</math>.
The hat matrix was introduced by John Wilder in 1972. An article by Hoaglin, D.C. and Welsch, R.E. (1978) gives the properties of the matrix and also many examples of its application.
There are a number of applications of such a decomposition. In the classical application <math>A</math> is a column of all ones, which allows one to analyze the effects of adding an intercept term to a regression. Another use is in the [[fixed effects model]], where <math>A</math> is a large [[sparse matrix]] of the dummy variables for the fixed effect terms. One can use this partition to compute the hat matrix of <math>X </math> without explicitly forming the matrix <math>X</math>, which might be too large to fit into computer memory.
 
== See also ==
Line 94 ⟶ 110:
 
[[Category:Regression analysis]]
[[Category:Matrices (mathematics)]]