Multivariate kernel density estimation: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:53, 30 July 2024 edit Drleft (talk \| contribs) 102 edits m Minor corrections/additions to citations to books "WJ1995" and "chacon2018" ← Previous edit		Latest revision as of 12:02, 17 June 2025 edit undo Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,585 edits →Density estimation with a diagonal bandwidth matrix
(2 intermediate revisions by 2 users not shown)
Line 1: {{Short description\|Concept in statistics mathematics}} [[Kernel density estimation]] is a [[nonparametric]] technique for [[density estimation]] i.e., estimation of [[probability density function]]s, which is one of the fundamental questions in [[statistics]]. It can be viewed as a generalisation of [[histogram]] density estimation with improved statistical properties. Apart from histograms, other types of density estimators include [[parametric statistics\|parametric]], [[spline interpolation\|spline]], [[wavelet]] and [[Fourier series]]. Kernel density estimators were first introduced in the scientific literature for [[univariate]] data in the 1950s and 1960s<ref>{{Cite journal\| doi=10.1214/aoms/1177728190 \| last=Rosenblatt \| first=M.\| title=Remarks on some nonparametric estimates of a density function \| journal=Annals of Mathematical Statistics \| year=1956 \| volume=27 \| issue=3 \| pages=832–837\| doi-access=free }}</ref><ref>{{Cite journal\| doi=10.1214/aoms/1177704472\| last=Parzen \| first=E.\| title=On estimation of a probability density function and mode \| journal=Annals of Mathematical Statistics\| year=1962 \| volume=33 \| issue=3 \| pages=1065–1076\| doi-access=free }}</ref> and subsequently have been widely adopted. It was soon recognised that analogous estimators for multivariate data would be an important addition to [[multivariate statistics]]. Based on research carried out in the 1990s and 2000s, '''multivariate kernel density estimation''' has reached a level of maturity comparable to its univariate counterparts.<ref name="WJ1995">{{Cite book\| author1=Wand, M.P \| author2=Jones, M.C. \| title=Kernel Smoothing \| publisher=Chapman & Hall/CRC \| ___location=London \| year=1995 \| isbn = 9780412552700}}</ref><ref name="simonoff1996">{{Cite book\| author=Simonoff, J.S. \| title=Smoothing Methods in Statistics \| publisher=Springer \| year=1996 \| isbn=9780387947167}}</ref><ref name="chacon2018">{{Cite book\| author=Chacón, J.E. and Duong, T. \| title=Multivariate Kernel Smoothing and Its applications \| publisher=Chapman & Hall/CRC \| year=2018 \| isbn=9781498763011}}</ref> Line 153 ⟶ 154: <syntaxhighlight lang="matlab" style="overflow:auto;"> clear all % generate synthetic data data=[randn(500, 2); randn(500, 1) + 3.5, randn(500, 1);]; % call the routine, which has been saved in the current directory [bandwidth, density, X, Y] = kde2d(data); % plot the data and the density estimate contour3(X, Y, density, 50), hold on plot(data(:,1), data(:,2), 'r.', 'MarkerSize', 5) </syntaxhighlight> Line 196 ⟶ 197: where, ''N'' is the number of data points, ''d'' is the number of dimensions (variables), and <math>I_{\vec{A}}(\vec{t})</math> is a filter that is equal to 1 for 'accepted frequencies' and 0 otherwise. There are various ways to define this filter function, and a simple one that works for univariate or multivariate samples is called the 'lowest contiguous hypervolume filter'; <math>I_{\vec{A}}(\vec{t})</math> is chosen such that the only accepted frequencies are a contiguous subset of frequencies surrounding the origin for which <math>\|\hat{\varphi}(\vec{t})\|^2 \ge 4(N-1)N^{-2}</math> (see <ref name=":22"/> for a discussion of this and other filter functions). Note that direct calculation of the ''empirical characteristic function'' (ECF) is slow, since it essentially involves a direct Fourier transform of the data samples. However, it has been found that the ECF can be approximated accurately using a [[Non-uniform discrete Fourier transform\|non-uniform fast Fourier transform]] (nuFFT) method,<ref name=":1" /><ref name=":22"/> which increases the calculation speed by several orders of magnitude (depending on the dimensionality of the problem). The combination of this objective KDE method and the nuFFT-based ECF approximation has been referred to as ''[https://~~bitbucket~~github.~~org~~com/~~lbl~~LBL-~~cascade~~EESA/fastkde fastKDE]'' in the literature.<ref name=":22"/> [[File:FastKDE_example.jpg\|alt=A demonstration of fastKDE relative to a sample PDF. (a) True PDF, (b) a good representation with fastKDE, and (c) a slightly blurry representation.\|none\|thumb\|664x664px\|A non-trivial mixture of normal distributions: (a) the underlying PDF, (b) a fastKDE estimate on 1,000,000 samples, and (c) a fastKDE estimate on 10,000 samples.]]