Kernel embedding of distributions: Difference between revisions

Content deleted Content added
Point out the input variable of the feature map in "Definitions" section.
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5
 
(5 intermediate revisions by 4 users not shown)
Line 1:
{{Short description|Class of nonparametric methods}}
{{cleanup|reason=This nonsense of calling a distribution ''P''(''X''), with a capital ''X'', when capital ''X'' is also the name of the random variable, and other like things, need to get cleaned up.|date=March 2020}}
 
In [[machine learning]], the '''kernel embedding of distributions''' (also called the '''kernel mean''' or '''mean map''') comprises a class of [[nonparametric]] methods in which a [[probability distribution]] is represented as an element of a [[reproducing kernel Hilbert space]] (RKHS).<ref name = "Smola2007">A. Smola, A. Gretton, L. Song, B. Schölkopf. (2007). [http://eprints.pascal-network.org/archive/00003987/01/SmoGreSonSch07.pdf A Hilbert Space Embedding for Distributions] {{Webarchive|url=https://web.archive.org/web/20131215111545/http://eprints.pascal-network.org/archive/00003987/01/SmoGreSonSch07.pdf |date=2013-12-15 }}. ''Algorithmic Learning Theory: 18th International Conference''. Springer: 13–31.</ref> A generalization of the individual data-point feature mapping done in classical [[kernel methods]], the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as [[inner product]]s, distances, [[projection (linear algebra)|projections]], [[linear transformation]]s, and [[spectral theory|spectral analysis]].<ref name = "Song2013">L. Song, K. Fukumizu, F. Dinuzzo, A. Gretton (2013). [http://www.gatsby.ucl.ac.uk/~gretton/papers/SonFukGre13.pdf Kernel Embeddings of Conditional Distributions: A unified kernel framework for nonparametric inference in graphical models]. ''IEEE Signal Processing Magazine'' '''30''': 98–111.</ref> This [[machine learning|learning]] framework is very general and can be applied to distributions over any space <math>\Omega </math> on which a sensible [[kernel function]] (measuring similarity between elements of <math>\Omega </math>) may be defined. For example, various kernels have been proposed for learning from data which are: [[Vector (mathematics and physics)|vectors]] in <math>\mathbb{R}^d</math>, discrete classes/categories, [[string (computer science)|string]]s, [[Graph (discrete mathematics)|graph]]s/[[network theory|networks]], images, [[time series]], [[manifold]]s, [[dynamical systems]], and other structured objects.<ref>J. Shawe-Taylor, N. Christianini. (2004). ''Kernel Methods for Pattern Analysis''. Cambridge University Press, Cambridge, UK.</ref><ref>T. Hofmann, B. Schölkopf, A. Smola. (2008). [http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1211819561 Kernel Methods in Machine Learning]. ''The Annals of Statistics'' '''36'''(3):1171–1220.</ref> The theory behind kernel embeddings of distributions has been primarily developed by [http://alex.smola.org/ Alex Smola], [http://www.cc.gatech.edu/~lsong/ Le Song ] {{Webarchive|url=https://web.archive.org/web/20210412002513/https://www.cc.gatech.edu/~lsong/ |date=2021-04-12 }}, [http://www.gatsby.ucl.ac.uk/~gretton/ Arthur Gretton], and [[Bernhard Schölkopf]]. A review of recent works on kernel embedding of distributions can be found in.<ref>{{Cite journal|last=Muandet|first=Krikamol|last2=Fukumizu|first2=Kenji|last3=Sriperumbudur|first3=Bharath|last4=Schölkopf|first4=Bernhard|date=2017-06-28|title=Kernel Mean Embedding of Distributions: A Review and Beyond|journal=Foundations and Trends in Machine Learning|language=English|volume=10|issue=1–2|pages=1–141|doi=10.1561/2200000060|issn=1935-8237|arxiv=1605.09522}}</ref>
 
The analysis of distributions is fundamental in [[machine learning]] and [[statistics]], and many algorithms in these fields rely on information theoretic approaches such as [[entropy]], [[mutual information]], or [[Kullback–Leibler divergence]]. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data.<ref name = "SongThesis">L. Song. (2008) [https://www.infosys.usyd.edu.au/research/2008_Song_Le_thesis.pdf Learning via Hilbert Space Embedding of Distributions]. PhD Thesis, University of Sydney.</ref> Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. [[Mixture model#Gaussian mixture model|Gaussian mixture models]]), while nonparametric methods like [[kernel density estimation]] (Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or [[characteristic function (probability theory)|characteristic function]] representation (via the [[Fourier transform]] of the distribution) break down in high-dimensional settings.<ref name = "Song2013" />
Line 92 ⟶ 93:
 
=== Universal kernels ===
* Let <math>\mathcal{X} \subseteq \mathbb{R}^{b}</math> be a compact metric space and <math>C(\mathcal{X})</math> the set of [[Function space#Functional analysis|continuous functions]]. The reproducing kernel <math>k:\mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R} </math> is called '''universal''' if and only if the RKHS <math>\mathcal{H}</math> of <math>k</math> is [[Dense set|dense]] in <math>C(\mathcal{X})</math>, i.e., for any <math>g \in C(\mathcal{X})</math> and all <math>\varepsilon > 0</math> there exists an <math> f \in \mathcal{H}</math> such that <math> \| f-g\|_{\infty} \leq \varepsilon</math>.<ref>*{{cite book |last1=Steinwart |first1=Ingo |last2=Christmann |first2=Andreas |title=Support Vector Machines |publisher=Springer |___location=New York |year=2008 |isbn=978-0-387-77241-7 }}</ref> All universal kernels defined on a compact space are characteristic kernels but the converse is not always true.<ref>{{Cite journal|last1=Sriperumbudur|first1= B. K.|last2=Fukumizu|first2=K.|last3=Lanckriet|first3=G.R.G.|title=Universality, Characteristic Kernels and RKHS Embedding of Measures|year= 2011 |journal=Journal of Machine Learning Research|volume=12|number=70}}</ref>
* Let <math>k</math> be a continuous [[Translational symmetry|translation invariant]] kernel <math>k(x, x') = h(x-x')</math> with <math>x \in \mathbb{R}^{b}</math>. Then [[Bochner's theorem]] guarantees the existence of a unique finite Borel measure <math>\mu</math> (called the [[Spectral theory of ordinary differential equations#Spectral measure|spectral measure]]) on <math>\mathbb{R}^{b}</math> such that
::<math>h(t) = \int_{\mathbb{R}^{b}} e^{-i\langle t, \omega \rangle} d\mu(\omega), \quad \forall t \in \mathbb{R}^{b}.</math>
Line 102 ⟶ 103:
* If <math>k</math> induces a strictly positive definite kernel matrix for any set of distinct points, then it is a universal kernel.<ref name = "SongThesis" /> For example, the widely used Gaussian RBF kernel
::<math> k(x,x') = \exp\left(-\frac{1}{2\sigma^2} \|x-x'\|^2 \right)</math>
:on compact subsets of <math>\mathbb{R}^db </math> is universal.
 
=== Parameter selection for conditional distribution kernel embeddings ===
Line 307 ⟶ 308:
 
=== Distribution regression ===
In distribution regression, the goal is to regress from probability distributions to reals (or vectors). Many important [[machine learning]] and statistical tasks fit into this framework, including [[Multiple-instance learning|multi-instance learning]], and [[point estimation]] problems without analytical solution (such as [[Hyperparameter (Bayesian statistics)|hyperparameter]] or [[entropy estimation]]). In practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between ''sets of points''. Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.<ref name = "MERR">Z. Szabó, B. Sriperumbudur, B. Póczos, A. Gretton. [http://jmlr.org/papers/v17/14-510.html Learning Theory for Distribution Regression]. ''Journal of Machine Learning Research'', 17(152):1–40, 2016.</ref>
 
Given <math>{\left(\{X_{i,n}\}_{n=1}^{N_i}, y_i\right)}_{i=1}^\ell</math> training data, where the <math>\hat{X_i} := \{X_{i,n}\}_{n=1}^{N_i}</math> bag contains samples from a probability distribution <math>X_i</math> and the <math>i^\text{th}</math> output label is <math>y_i\in \R</math>, one can tackle the distribution regression task by taking the embeddings of the distributions, and learning the regressor from the embeddings to the outputs. In other words, one can consider the following kernel [[Tikhonov regularization|ridge regression]] problem <math>(\lambda>0)</math>
Line 329 ⟶ 330:
:<math>\mathcal{C}_{XY} = \mathbb{E} [\mathbf{e}_X \otimes \mathbf{e}_Y] = ( P(X=s, Y=t))_{s,t \in \{1,\ldots,K\}} </math>
 
TheWhen <math> P(X=s)>0 </math>, for all <math> s \in \{1,\ldots,K\} </math>, the conditional distribution embedding operator,
 
:<math>\mathcal{C}_{Y\mid X} = \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1},</math>
Line 347 ⟶ 348:
In this discrete-valued setting with the Kronecker delta kernel, the [[#Rules of probability as operations in the RKHS|kernel sum rule]] becomes
 
:<math>\underbrace{\begin{pmatrix} QP(X=1) \\ \vdots \\ P(X = N) \\ \end{pmatrix}}_{\mu_X^\pi} = \underbrace{\begin{pmatrix} \\ P(X=s \mid Y=t) \\ \\ \end{pmatrix}}_{\mathcal{C}_{X\mid Y}} \underbrace{\begin{pmatrix} \pi(Y=1) \\ \vdots \\ \pi(Y = N) \\ \end{pmatrix}}_{ \mu_Y^\pi}</math>
 
The [[#Rules of probability as operations in the RKHS|kernel chain rule]] in this case is given by
 
:<math>\underbrace{\begin{pmatrix} \\ QP(X=s,Y=t) \\ \\ \end{pmatrix} }_{\mathcal{C}_{XY}^\pi} = \underbrace{\begin{pmatrix} \\ P(X=s \mid Y=t) \\ \\ \end{pmatrix} }_{\mathcal{C}_{X \mid Y}} \underbrace{ \begin{pmatrix} \pi(Y=1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \pi(Y=K) \\
\end{pmatrix} }_{\mathcal{C}_{YY}^\pi} </math>