Revision as of 01:13, 16 September 2010 edit Drleft (talk \| contribs) 102 edits No edit summary ← Previous edit		Revision as of 01:49, 16 September 2010 edit undo Drleft (talk \| contribs) 102 edits No edit summary Next edit →
Line 7: == Motivation == ~~To motivate the definition of multivariate kernel density estimators, we~~We take ~~as an~~a illustrative synthetic [[bivariate]] data set of 50 points. ~~First~~to weillustrate ~~create~~the aconstruction ~~histogram~~of histograms. This requires the choice of an anchor point (the lower left corner of the histogram grid). For the histogram on the left, we choose (-1.5, -1.5): for the one on the right, we shift the anchor point by 0.125 in both directions to (-1.625, -1.625). Both histograms have a binwidth of 0.5, so any differences are due to the change in the anchor point only. The colour coding indicates the number of data points which fall into a bin: 0=white, 1=pale yellow, 2=bright yellow, 3=orange, 4=red. The left histogram appears to indicate that the upper half has a higher density than the lower half, whereas it is the reverse is the case. Itfor isthe ~~well~~right-~~known~~hand histogram, confirming that histograms are highly sensitive the placement of the anchor point{{cite needed}}. [[Image:Synthetic data 2D histograms.png\|center\|720px]] One possible solution to this anchor point placement problem to remove the histogram binning grid completely. In the left figure below, a kernel (represented by the dashed grey lines) is ~~placed~~centred at ~~the~~each ~~same~~of the 50 data points above. ~~Then the~~The result of summing these kernels is given on the right figure, which is a kernel density estimate. This indicates that the highest density is a single central region. [[Image:Synthetic data 2D KDE.png\|center\|720px]] == Definition == The previous figure is a graphical representation of kernel density estimate., Wewhich we now define it in an exact manner. Let <math>\bold{X}_1, \bold{X}_2, \dots, \bold{X}_n</math> be a <em>d</em>-variate random sample drawn from a common density function <em>f</em>. The kernel density estimate is defined to be ~~Let <math>\bold{X}_1, \bold{X}_2, \dots, \bold{X}_n</math> be a <em>d</em>-variate random sample drawn from a common density function <em>f</em>. The kernel density estimate is defined to be~~ <math>\hat{f}_\bold{H}(\bold{x})= n^{-1} \|\bold{H}\|^{-1/2} \sum_{i=1}^n K_\bold{H} (\bold{x} - \bold{X}_i)</math> Line 28 ⟶ 27: </ul> The choice of the kernel function <em>K</em> is not crucial to the accuracy of kernel density estimators, so we use the standard [[multivariate normal distribution\|multivariate normal]] or Gaussian density function as our kernel <em>K</em> throughout: <math>K (\bold{x}) = (2\pi)^{-d/2} \exp(-\tfrac{1}{2} \, \bold{x}^T \bold{x})</math>. Whereas the choice of the bandwidth matrix <strong>H</strong> is the single most important factor affecting its accuracy since it controls the amount of and orientation of smoothing induced.<ref name="WJ1995">{{cite book \| author1=Wand, M.P \| author2=Jones, M.C. \| title=Kernel Smoothing \| publisher=Chapman & Hall \| ___location=London \| date=1995 \| isbn = 0412552701}}</ref>(pp. 36-39). Whereas the choice of the bandwidth matrix <strong>H</strong> is the single most important factor affecting its accuracy <ref name="WJ1995">{{cite book \| author1=Wand, M.P \| author2=Jones, M.C. \| title=Kernel Smoothing \| publisher=Chapman & Hall \| ___location=London \| date=1995 \| isbn = 0412552701}}</ref>(pp. 36-39). (The bandwith for a kernel density estimate is analogous to the binwidth of histograms). ~~<math>K (\bold{x}) = (2\pi)^{-d/2} \exp(-\tfrac{1}{2} \, \bold{x}^T \bold{x}).</math>~~ == Optimal bandwidth matrix selection == The most commonly used optimality criterion for selecting a bandwidth matrix is the MISE or Mean Integrated Squared Error Line 38 ⟶ 34: <math>\operatorname{MISE} (\bold{H}) = E \int [\hat{f}_\bold{H} (\bold{x}) - f(\bold{x})]^2 \, d\bold{x} .</math> This is in general does not possess a closed form expression, so it is usual to use its asymptotic approximation (AMISE) as a proxy <math>\operatorname{AMISE} (\bold{H}) = n^{-1} \|\bold{H}\|^{-1/2} R(K) + \tfrac{1}{4} m_2(K)^2 Line 59 ⟶ 55: where o, O indicate the usual small and [[big O notation]]. Heuristically this statement implies that the AMISE is a 'good' approximation of the MISE as the sample size <em>n → ∞<em>. ~~The~~An ideal optimal bandwidth selector is <math>\bold{H}_{\operatorname{AMISE}} = \operatorname{argmin}_{\bold{H} \in F} \, \operatorname{AMISE} (\bold{H})</math> where <em>F</em> is the space of all symmetric, positive definite matrices. Line 65 ⟶ 61: === Plug-in === The plug-in (PI) ~~selector~~estimate of the AMISE is formed by replacing ~~the Hessian matrix~~<math>\bold{\Psi}_4</math> by its estimator <math>\hat{\bold{\Psi}}_4</math> <math>\operatorname{PI}(\bold{H}) = n^{-1} \|\bold{H}\|^{-1/2} R(K) + \tfrac{1}{4} m_2(K)^2 (\operatorname{vec}^T \bold{H}) \hat{\bold{\Psi}}_4 (\bold{G}) (\operatorname{vec} \, \bold{H})</math> where <math>\hat{\bold{\Psi}}_4 (\bold{G}) = n^{-12} \sum_{i=1}^n \sum_{j=1}^n [(\operatorname{vec} \, \operatorname{D}^2) (\operatorname{vec}^T \operatorname{D}^2)] K_\bold{G} (\bold{X}_i - \bold{X}_j)</math>. Thus <math>\hat{\bold{H}}_{\operatorname{PI}} = \operatorname{argmin}_{\bold{H} \in F} \, \operatorname{PI} (\bold{H})</math> is the plug-in selector<ref>{{cite journal \| author1=Wand, M.P. \| author2=Jones, M.C. \| title=Multivariate plug-in bandwidth selection \| journal=Computational Statistics \| year=1994 \| volume=9 \| pages=97-177}}</ref><ref>{{cite journal \| doi=10.1080/10485250306039 \| author1=Duong, T. \| author2=Hazelton, M.L. \| title=Plug-in bandwidth matrices for bivariate kernel density estimation \| journal=Journal of Nonparametric Statistics \| year=2003 \| volume=15 \| pages=17-30}}</ref>. These references also contain algorithms on optimal estimation of the pilot bandwidth matrix <strong>G</strong>. Line 84 ⟶ 80: == Computer implementation== The [http://cran.r-project.org/web/packages/ks/index.html ks package]<ref>{{cite journal \| author1=Duong, T. \| title=ks: Kernel density estimation and kernel discriminant analysis in R \| journal=Journal of Statistical Software \| year=2007 \| volume=21(7) \| url=http://www.jstatsoft.org/v21/i07}}</ref> in [[R programming language\|R]] implements the plug-in and smoothed cross validation selectors. (amongst ~~This example is based on the [[Old Faithful Geyser]] in Yellowstone National Park, USA~~others). This dataset contains 272 records with two measurements each: the ~~eruption~~ duration time of an eruprion (minutes) and the waiting time until the next eruption (minutes) of [the [Old Faithful Geyser]] in Yellowstone National Park, ~~and~~USA. isThis ~~contained~~dataset included in the base distribution of R. ~~This~~ The code fragment computes the kernel density estimate with the plug-in bandwidth matrix <math>\hat{\bold{H}}_\operatorname{PI} = \begin{bmatrix}0.052 & 0.510 \\ 0.510 & 8.882\end{bmatrix}.</math> The coloured contours correspond to the smallest ~~region~~regions which contains that corresponding probability mass: red = 25%, orange + red = 50%, yellow + orange + red = 75%. To compute the SCV selector, <code>Hpi</code> is replaced with <code>Hscv</code>. This is not displayed here since it is mostly similar to the plug-in estimate for this ~~data set~~example. <pre>

Multivariate kernel density estimation: Difference between revisions