Multivariate kernel density estimation: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: add arxiv identifier to citation with #oabot.
m Replacing deprecated latex syntax mw:Extension:Math/Roadmap
Line 16:
The previous figure is a graphical representation of kernel density estimate, which we now define in an exact manner. Let '''x'''<sub>1</sub>, '''x'''<sub>2</sub>, …, '''x'''<sub>''n''</sub> be a [[random sample|sample]] of ''d''-variate [[random vector]]s drawn from a common distribution described by the [[probability density function|density function]] ''ƒ''. The kernel density estimate is defined to be
: <math>
\hat{f}_\boldmathbf{H}(\boldmathbf{x})= \frac1n \sum_{i=1}^n K_\boldmathbf{H} (\boldmathbf{x} - \boldmathbf{x}_i)
</math>
where
Line 31:
The most commonly used optimality criterion for selecting a bandwidth matrix is the MISE or [[mean integrated squared error]]
 
: <math>\operatorname{MISE} (\boldmathbf{H}) = \operatorname{E}\!\left[\, \int (\hat{f}_\boldmathbf{H} (\boldmathbf{x}) - f(\boldmathbf{x}))^2 \, d\boldmathbf{x} \;\right].</math>
 
This in general does not possess a [[closed-form expression]], so it is usual to use its asymptotic approximation (AMISE) as a proxy
 
: <math>\operatorname{AMISE} (\boldmathbf{H}) = n^{-1} |\boldmathbf{H}|^{-1/2} R(K) + \tfrac{1}{4} m_2(K)^2
(\operatorname{vec}^T \boldmathbf{H}) \boldmathbf{\Psi}_4 (\operatorname{vec} \boldmathbf{H})</math>
 
where
* <math>R(K) = \int K(\boldmathbf{x})^2 \, d\boldmathbf{x}</math>, with {{nowrap|''R''(''K'') {{=}} (4''π'')<sup>''−d''/2</sup>}} when ''K'' is a normal kernel
* <math>\int \boldmathbf{x} \boldmathbf{x}^T K(\boldmathbf{x}) \, d\boldmathbf{x} = m_2(K) \boldmathbf{I}_d</math>,
:with <strong>I</strong><sub>d</sub> being the ''d × d'' [[identity matrix]], with ''m''<sub>2</sub> = 1 for the normal kernel
* D<sup>2</sup>''ƒ'' is the ''d × d'' Hessian matrix of second order partial derivatives of ''ƒ''
* <math>\boldmathbf{\Psi}_4 = \int (\operatorname{vec} \, \operatorname{D}^2 f(\boldmathbf{x})) (\operatorname{vec}^T \operatorname{D}^2 f(\boldmathbf{x})) \, d\boldmathbf{x}</math> is a ''d''<sup>2</sup> × ''d''<sup>2</sup> matrix of integrated fourth order partial derivatives of ''ƒ''
* vec is the vector operator which stacks the columns of a matrix into a single vector e.g. <math>\operatorname{vec}\begin{bmatrix}a & c \\ b & d\end{bmatrix} = \begin{bmatrix}a & b & c & d\end{bmatrix}^T.</math>
 
The quality of the AMISE approximation to the MISE<ref name="WJ1995"/>{{rp|97}} is given by
 
: <math>\operatorname{MISE} (\boldmathbf{H}) = \operatorname{AMISE} (\boldmathbf{H}) + o(n^{-1} |\boldmathbf{H}|^{-1/2} + \operatorname{tr} \, \boldmathbf{H}^2)</math>
 
where ''o'' indicates the usual [[big O notation|small o notation]]. Heuristically this statement implies that the AMISE is a 'good' approximation of the MISE as the sample size <var>n</var> → ∞.
Line 54:
It can be shown that any reasonable bandwidth selector '''H''' has '''H''' = ''O''(''n''<sup>−2/(''d''+4)</sup>) where the [[big O notation]] is applied elementwise. Substituting this into the MISE formula yields that the optimal MISE is ''O''(''n''<sup>−4/(''d''+4)</sup>).<ref name="WJ1995"/>{{rp|99–100}} Thus as ''n'' → ∞, the MISE → 0, i.e. the kernel density estimate [[convergence in mean|converges in mean square]] and thus also in probability to the true density ''f''. These modes of convergence are confirmation of the statement in the motivation section that kernel methods lead to reasonable density estimators. An ideal optimal bandwidth selector is
 
: <math>\boldmathbf{H}_{\operatorname{AMISE}} = \operatorname{argmin}_{\boldmathbf{H} \in F} \, \operatorname{AMISE} (\boldmathbf{H}).</math>
 
Since this ideal selector contains the unknown density function ''ƒ'', it cannot be used directly. The many different varieties of data-based bandwidth selectors arise from the different estimators of the AMISE. We concentrate on two classes of selectors which have been shown to be the most widely applicable in practice: smoothed cross validation and plug-in selectors.
 
===Plug-in===
The plug-in (PI) estimate of the AMISE is formed by replacing '''Ψ'''<sub>4</sub> by its estimator <math>\hat{\boldmathbf{\Psi}}_4</math>
 
: <math>\operatorname{PI}(\boldmathbf{H}) = n^{-1} |\boldmathbf{H}|^{-1/2} R(K) + \tfrac{1}{4} m_2(K)^2
(\operatorname{vec}^T \boldmathbf{H}) \hat{\boldmathbf{\Psi}}_4 (\boldmathbf{G}) (\operatorname{vec} \, \boldmathbf{H})</math>
 
where <math>\hat{\boldmathbf{\Psi}}_4 (\boldmathbf{G}) = n^{-2} \sum_{i=1}^n
\sum_{j=1}^n [(\operatorname{vec} \, \operatorname{D}^2) (\operatorname{vec}^T \operatorname{D}^2)] K_\boldmathbf{G} (\boldmathbf{X}_i - \boldmathbf{X}_j)</math>. Thus <math>\hat{\boldmathbf{H}}_{\operatorname{PI}} = \operatorname{argmin}_{\boldmathbf{H} \in F} \, \operatorname{PI} (\boldmathbf{H})</math> is the plug-in selector.<ref>{{Cite journal| author1=Wand, M.P. | author2=Jones, M.C. | title=Multivariate plug-in bandwidth selection | journal=Computational Statistics | year=1994 | volume=9 | pages=97–177}}</ref><ref name="DH2005" /> These references also contain algorithms on optimal estimation of the pilot bandwidth matrix <strong>G</strong> and establish that <math>\hat{\boldmathbf{H}}_{\operatorname{PI}}</math> [[convergence in probability|converges in probability]] to '''H'''<sub>AMISE</sub>.
 
===Smoothed cross validation===
Smoothed cross validation (SCV) is a subset of a larger class of [[cross-validation (statistics)|cross validation]] techniques. The SCV estimator differs from the plug-in estimator in the second term
 
: <math>\operatorname{SCV}(\boldmathbf{H}) = n^{-1} |\boldmathbf{H}|^{-1/2} R(K) +
n^{-2} \sum_{i=1}^n \sum_{j=1}^n (K_{2\boldmathbf{H} +2\boldmathbf{G}} - 2K_{\boldmathbf{H} +2\boldmathbf{G}}
+ K_{2\boldmathbf{G}}) (\boldmathbf{X}_i - \boldmathbf{X}_j)</math>
 
Thus <math>\hat{\boldmathbf{H}}_{\operatorname{SCV}} = \operatorname{argmin}_{\boldmathbf{H} \in F} \, \operatorname{SCV} (\boldmathbf{H})</math> is the SCV selector.<ref name="DH2005">{{Cite journal| doi=10.1111/j.1467-9469.2005.00445.x | author1=Duong, T. | author2=Hazelton, M.L. | title=Cross validation bandwidth matrices for multivariate kernel density estimation | journal=Scandinavian Journal of Statistics | year=2005 | volume=32 | issue=3 | pages=485–506}}</ref><ref>{{Cite journal| doi=10.1007/BF01205233 | author1=Hall, P. | author2=Marron, J. | author3=Park, B. | title=Smoothed cross-validation | journal=Probability Theory and Related Fields | year=1992 | volume=92 | pages=1–20}}</ref>
These references also contain algorithms on optimal estimation of the pilot bandwidth matrix <strong>G</strong> and establish that <math>\hat{\boldmathbf{H}}_{\operatorname{SCV}}</math> converges in probability to '''H'''<sub>AMISE</sub>.
 
=== Rule of thumb ===
Line 85:
In the optimal bandwidth selection section, we introduced the MISE. Its construction relies on the [[expected value]] and the [[variance]] of the density estimator<ref name="WJ1995" />{{rp|97}}
 
:<math>\operatorname{E} \hat{f}(\boldmathbf{x};\boldmathbf{H}) = K_\boldmathbf{H} * f (\boldmathbf{x}) = f(\boldmathbf{x}) + \frac{1}{2} m_2(K) \int \operatorname{tr} (\boldmathbf{H} \operatorname{D}^2 f(\boldmathbf{x})) \, d\boldmathbf{x} + O(\operatorname{tr} \, \boldmathbf{H}^2)</math>
 
where * is the [[convolution]] operator between two functions, and
 
:<math>\operatorname{Var} \hat{f}(\boldmathbf{x};\boldmathbf{H}) = n^{-1} |\boldmathbf{H}|^{-1/2} R(K) + o(n^{-1} |\boldmathbf{H}|^{-1/2}).</math>
 
For these two expressions to be well-defined, we require that all elements of '''H''' tend to 0 and that ''n''<sup>−1</sup> |'''H'''|<sup>−1/2</sup> tends to 0 as ''n'' tends to infinity. Assuming these two conditions, we see that the expected value tends to the true density ''f'' i.e. the kernel density estimator is asymptotically [[Bias of an estimator|unbiased]]; and that the variance tends to zero. Using the standard mean squared value decomposition
 
:<math>\operatorname{MSE} \, \hat{f}(\boldmathbf{x};\boldmathbf{H}) = \operatorname{Var} \hat{f}(\boldmathbf{x};\boldmathbf{H}) + [\operatorname{E} \hat{f}(\boldmathbf{x};\boldmathbf{H}) - f(\boldmathbf{x})]^2</math>
 
we have that the MSE tends to 0, implying that the kernel density estimator is (mean square) consistent and hence converges in probability to the true density ''f''. The rate of convergence of the MSE to 0 is the necessarily the same as the MISE rate noted previously ''O''(''n''<sup>−4/(d+4)</sup>), hence the covergence rate of the density estimator to ''f'' is ''O<sub>p</sub>''(n<sup>−2/(''d''+4)</sup>) where ''O<sub>p</sub>'' denotes [[Big O in probability notation|order in probability]]. This establishes pointwise convergence. The functional covergence is established similarly by considering the behaviour of the MISE, and noting that under sufficient regularity, integration does not affect the convergence rates.
Line 99:
For the data-based bandwidth selectors considered, the target is the AMISE bandwidth matrix. We say that a data-based selector converges to the AMISE selector at relative rate ''O<sub>p</sub>''(''n''<sup>−''α''</sup>), ''α'' > 0 if
 
:<math>\operatorname{vec} (\hat{\boldmathbf{H}} - \boldmathbf{H}_{\operatorname{AMISE}}) = O(n^{-2\alpha}) \operatorname{vec} \boldmathbf{H}_{\operatorname{AMISE}}.</math>
 
It has been established that the plug-in and smoothed cross validation selectors (given a single pilot bandwidth '''G''') both converge at a relative rate of ''O<sub>p</sub>''(''n''<sup>−2/(''d''+6)</sup>) <ref name="DH2005" /><ref>{{Cite journal| doi=10.1016/j.jmva.2004.04.004 | author1=Duong, T. | author2=Hazelton, M.L. | title=Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation | journal=Journal of Multivariate Analysis | year=2005 | volume=93 | issue=2 | pages=417–433}}</ref> i.e., both these data-based selectors are consistent estimators.
Line 110:
waiting time until the next eruption (minutes) of the [[Old Faithful Geyser]] in Yellowstone National Park, USA.
 
The code fragment computes the kernel density estimate with the plug-in bandwidth matrix <math>\hat{\boldmathbf{H}}_{\operatorname{PI}} = \begin{bmatrix}0.052 & 0.510 \\ 0.510 & 8.882\end{bmatrix}.</math> Again, the coloured contours correspond to the smallest region which contains the respective probability mass: red = 25%, orange + red = 50%, yellow + orange + red = 75%. To compute the SCV selector, <code>Hpi</code> is replaced with <code>Hscv</code>. This is not displayed here since it is mostly similar to the plug-in estimate for this example.
 
<source lang="rsplus" style="overflow:auto;">
Line 169:
There are alternative optimality criteria, which attempt to cover cases where MISE is not an appropriate measure.<ref name="simonoff1996" />{{rp|34–37,78}} The equivalent ''L<sub>1</sub>'' measure, Mean Integrated Absolute Error, is
 
: <math>\operatorname{MIAE} (\boldmathbf{H}) = \operatorname{E}\, \int |\hat{f}_\boldmathbf{H} (\boldmathbf{x}) - f(\boldmathbf{x})| \, d\boldmathbf{x}.</math>
 
Its mathematical analysis is considerably more difficult than the MISE ones. In practise, the gain appears not to be significant.<ref>{{cite journal | author1=Hall, P. | author2=Wand, M.P. | title=Minimizing L<sub>1</sub> distance in nonparametric density estimation | journal = Journal of Multivariate Analysis | year=1988 | volume=26 | pages=59–88 | doi=10.1016/0047-259X(88)90073-5}}</ref> The ''L<sub>∞</sub>'' norm is the Mean Uniform Absolute Error
 
: <math>\operatorname{MUAE} (\boldmathbf{H}) = \operatorname{E}\, \operatorname{sup}_{\boldmathbf{x}} |\hat{f}_\boldmathbf{H} (\boldmathbf{x}) - f(\boldmathbf{x})|.</math>
which has been investigated only briefly.<ref>{{cite journal | author1=Cao, R. | author2=Cuevas, A. | author3=Manteiga, W.G.| title=A comparative study of several smoothing methods in density estimation | journal = Computational Statistics and Data Analysis | year=1994 | volume=17 | issue=2 | pages=153–176 | doi=10.1016/0167-9473(92)00066-Z}}</ref> Likelihood error criteria include those based on the Mean [[Kullback-Leibler distance]]
 
: <math>\operatorname{MKL} (\boldmathbf{H}) = \int f(\boldmathbf{x}) \, \operatorname{log} [f(\boldmathbf{x})] \, d\boldmathbf{x} - \operatorname{E} \int f(\boldmathbf{x}) \, \operatorname{log} [\hat{f}(\boldmathbf{x};\boldmathbf{H})] \, d\boldmathbf{x}</math>
 
and the Mean [[Hellinger distance]]
 
: <math>\operatorname{MH} (\boldmathbf{H}) = \operatorname{E} \int (\hat{f}_\boldmathbf{H} (\boldmathbf{x})^{1/2} - f(\boldmathbf{x})^{1/2})^2 \, d\boldmathbf{x} .</math>
 
The KL can be estimated using a cross-validation method, although KL cross-validation selectors can be sub-optimal even if it remains [[Consistent estimator|consistent]] for bounded density functions.<ref>{{cite journal | author=Hall, P. | title=On Kullback-Leibler loss and density estimation | journal=Annals of Statistics | volume=15 | issue=4 | year=1989 | pages=589–605 | doi=10.1214/aos/1176350606}}</ref> MH selectors have been briefly examined in the literature.<ref>{{cite journal | author1=Ahmad, I.A. | author2=Mugdadi, A.R. | title=Weighted Hellinger distance as an error criterion for bandwidth selection in kernel estimation | journal=Journal of Nonparametric Statistics | volume=18 | issue=2 | year=2006 | pages=215–226 | doi=10.1080/10485250600712008}}</ref>