Kernel embedding of distributions: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:03, 27 March 2020 edit Michael Hardy (talk \| contribs) Administrators 210,597 edits No edit summary ← Previous edit		Latest revision as of 06:16, 22 May 2025 edit undo InternetArchiveBot (talk \| contribs) Bots, Pending changes reviewers 5,698,963 edits Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5
(22 intermediate revisions by 14 users not shown)
Line 1: {{Short description\|Class of nonparametric methods}} {{cleanup\|reason=This nonsense of calling a distribution ''P''(''X''), with a capital ''X'', when capital ''X'' is also the name of the random variable, and other like things, need to get cleaned up.\|date=March 2020}} In [[machine learning]], the '''kernel embedding of distributions''' (also called the '''kernel mean''' or '''mean map''') comprises a class of [[nonparametric]] methods in which a [[probability distribution]] is represented as an element of a [[reproducing kernel Hilbert space]] (RKHS).<ref name = "Smola2007">A. Smola, A. Gretton, L. Song, B. Schölkopf. (2007). [http://eprints.pascal-network.org/archive/00003987/01/SmoGreSonSch07.pdf A Hilbert Space Embedding for Distributions] {{Webarchive\|url=https://web.archive.org/web/20131215111545/http://eprints.pascal-network.org/archive/00003987/01/SmoGreSonSch07.pdf \|date=2013-12-15 }}. ''Algorithmic Learning Theory: 18th International Conference''. Springer: 13–31.</ref> A generalization of the individual data-point feature mapping done in classical [[kernel methods]], the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as [[inner product]]s, distances, [[projection (linear algebra)\|projections]], [[linear transformation]]s, and [[spectral theory\|spectral analysis]].<ref name = "Song2013">L. Song, K. Fukumizu, F. Dinuzzo, A. Gretton (2013). [http://www.gatsby.ucl.ac.uk/~gretton/papers/SonFukGre13.pdf Kernel Embeddings of Conditional Distributions: A unified kernel framework for nonparametric inference in graphical models]. ''IEEE Signal Processing Magazine'' '''30''': 98–111.</ref> This [[machine learning\|learning]] framework is very general and can be applied to distributions over any space <math>\Omega </math> on which a sensible [[kernel function]] (measuring similarity between elements of <math>\Omega </math>) may be defined. For example, various kernels have been proposed for learning from data which are: [[Vector (mathematics and physics)\|vectors]] in <math>\mathbb{R}^d</math>, discrete classes/categories, [[string (computer science)\|string]]s, [[Graph (discrete mathematics)\|graph]]s/[[network theory\|networks]], images, [[time series]], [[manifold]]s, [[dynamical systems]], and other structured objects.<ref>J. Shawe-Taylor, N. Christianini. (2004). ''Kernel Methods for Pattern Analysis''. Cambridge University Press, Cambridge, UK.</ref><ref>T. Hofmann, B. Schölkopf, A. Smola. (2008). [http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1211819561 Kernel Methods in Machine Learning]. ''The Annals of Statistics'' '''36'''(3):1171–1220.</ref> The theory behind kernel embeddings of distributions has been primarily developed by [http://alex.smola.org/ Alex Smola], [http://www.cc.gatech.edu/~lsong/ Le Song ] {{Webarchive\|url=https://web.archive.org/web/20210412002513/https://www.cc.gatech.edu/~lsong/ \|date=2021-04-12 }}, [http://www.gatsby.ucl.ac.uk/~gretton/ Arthur Gretton], and [[Bernhard Schölkopf]]. A review of recent works on kernel embedding of distributions can be found in.<ref>{{Cite journal\|last=Muandet\|first=Krikamol\|last2=Fukumizu\|first2=Kenji\|last3=Sriperumbudur\|first3=Bharath\|last4=Schölkopf\|first4=Bernhard\|date=2017-06-28\|title=Kernel Mean Embedding of Distributions: A Review and Beyond\|journal=Foundations and Trends in Machine Learning\|language=English\|volume=10\|issue=1–2\|pages=1–141\|doi=10.1561/2200000060\|issn=1935-8237\|arxiv=1605.09522}}</ref> The analysis of distributions is fundamental in [[machine learning]] and [[statistics]], and many algorithms in these fields rely on information theoretic approaches such as [[entropy]], [[mutual information]], or [[Kullback–Leibler divergence]]. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data.<ref name = "SongThesis">L. Song. (2008) [~~http~~https://www.ccinfosys.~~gatech~~usyd.edu.au/~~~lsong/papers~~research/~~lesong_thesis~~2008_Song_Le_thesis.pdf Learning via Hilbert Space Embedding of Distributions]. PhD Thesis, University of Sydney.</ref> Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. [[Mixture model#Gaussian mixture model\|Gaussian mixture models]]), while nonparametric methods like [[kernel density estimation]] (Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or [[characteristic function (probability theory)\|characteristic function]] representation (via the [[Fourier transform]] of the distribution) break down in high-dimensional settings.<ref name = "Song2013" /> Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages:<ref name = "SongThesis" /> Line 16 ⟶ 17: ==Definitions== Let <math>X</math> denote a random variable with ___domain <math>\Omega</math> and distribution <math>P.</math>. Given a symmetric, [[positive-definite kernel]] <math>k~~</math> on <math>~~:\Omega \times \Omega, \rightarrow \mathbb{R}</math> the [[Reproducing kernel Hilbert space#Moore–Aronszajn theorem\|Moore–Aronszajn theorem]] asserts the existence of a unique RKHS <math>\mathcal{H}</math> on <math>\Omega</math> (a [[Hilbert space]] of functions <math>f: \Omega \to \R</math> equipped with an inner ~~products~~product <math>\langle \cdot, \cdot \rangle_\mathcal{H}</math> and ~~norms~~a norm <math>\\| \cdot \\|_\mathcal{H}</math>) for which <math>k</math> is a reproducing kernel, i.e., in which the element <math>k(x,\cdot)</math> satisfies the reproducing property :<math>\~~forall~~langle f, k(x,\incdot) \rangle_\mathcal{H}, ~~\forall~~= f(x ~~\in \Omega~~) \qquad \~~langle~~forall f, ~~k(x,~~\~~cdot)~~in ~~\rangle_~~\mathcal{H},\quad =\forall f(x) \in \Omega.</math> One may alternatively consider <math>x \mapsto k(x,\cdot)</math> as an implicit feature mapping <math>\varphi~~(x)</math> from <math>~~:\Omega~~</math> to <math>~~\rightarrow \mathcal{H} </math> (which is therefore also called the feature space), so that <math>k(x, x') = \langle \varphi(x), \varphi(x')\rangle_\mathcal{H}</math> can be viewed as a measure of similarity between points <math>x, x' \in \Omega.</math> While the [[similarity measure]] is linear in the feature space, it may be highly nonlinear in the original space depending on the choice of kernel. ===Kernel embedding=== Line 35 ⟶ 36: ===Joint distribution embedding=== If <math>Y</math> denotes another random variable (for simplicity, assume the co-___domain of <math>Y</math> is also <math>\Omega</math> with the same kernel <math>k</math> which satisfies <math> \langle \varphi(x) \otimes \varphi(y), \varphi(x') \otimes \varphi(y') \rangle = k(x,x') ~~\otimes~~ k(y,y')</math>), then the [[Joint probability distribution\|joint distribution]] <math> P(x,y)) </math> can be mapped into a [[tensor product]] feature space <math>\mathcal{H} \otimes \mathcal{H} </math> via <ref name = "Song2013"/> :<math> \mathcal{C}_{XY} = \mathbb{E~~}_{XY~~} [\varphi(X) \otimes \varphi(Y)] = \int_{\Omega \times \Omega} \varphi(x) \otimes \varphi(y) \ \mathrm{d} P(x,y) </math> By the equivalence between a [[tensor]] and a [[linear map]], this joint embedding may be interpreted as an uncentered [[cross-covariance]] operator <math>\mathcal{C}_{XY}: \mathcal{H} \to \mathcal{H}</math> from which the cross-covariance of ~~mean-zero~~ functions <math>f,g \in \mathcal{H}</math> can be computed as <ref name = "SongCDE">L. Song, J. Huang, A. J. Smola, K. Fukumizu. (2009).[http://www.stanford.edu/~jhuang11/research/pubs/icml09/icml09.pdf Hilbert space embeddings of conditional distributions]. ''Proc. Int. Conf. Machine Learning''. Montreal, Canada: 961–968.</ref> :<math>\operatorname{Cov} (f(X), g(Y)) := \mathbb{E}_ [f(X) g(Y)] - \mathbb{XYE} [f(X)] \mathbb{E} [g(Y)] = \langle f , \mathcal{C}_{XY} g \rangle_{\mathcal{H}} = \langle f \otimes g , \mathcal{C}_{XY} \rangle_{\mathcal{H} \otimes \mathcal{H}}</math> Given <math>n</math> pairs of training examples <math>\{(x_1, y_1), \dots, (x_n, y_n)\} </math> drawn i.i.d. from <math>P</math>, we can also empirically estimate the joint distribution kernel embedding via Line 50 ⟶ 51: Given a [[conditional distribution]] <math>P(y\mid x),</math> one can define the corresponding RKHS embedding as <ref name = "Song2013"/> :<math>\mu_{Y \mid x} = \mathbb{E~~}_{Y \mid x~~} [ \varphi(Y)\mid X ] = \int_\Omega \varphi(y) \ \mathrm{d}P(y \mid x) </math> Note that the embedding of <math>P(y\mid x) </math> thus defines a family of points in the RKHS indexed by the values <math>x</math> taken by conditioning variable <math>X</math>. By fixing <math>X</math> to a particular value, we obtain a single element in <math>\mathcal{H}</math>, and thus it is natural to define the operator :<math>\begin{cases} \mathcal{C}_{Y\mid X}: \mathcal{H} \to \mathcal{H} \\ \mathcal{C}_{Y\mid X} = \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1} \end{cases}</math> which given the feature mapping of <math>x</math> outputs the conditional embedding of <math>Y</math> given <math>X = x.</math> Assuming that for all <math>g \in \mathcal{H}: \mathbb{E~~}_{Y \mid X~~} [g(Y)\mid X] \in \mathcal{H},</math> it can be shown that <ref name = "SongCDE" /> :<math> \mu_{Y \mid x} = \mathcal{C}_{Y \mid X} \varphi(x)</math> Line 66 ⟶ 67: :<math>\widehat{C}_{Y\mid X} = \boldsymbol{\Phi} (\mathbf{K} + \lambda \mathbf{I})^{-1} \boldsymbol{\Upsilon}^T</math> where <math>\boldsymbol{\Phi} = \left(\varphi(~~y_i~~y_1),\dots, \varphi(y_n)\right), \boldsymbol{\Upsilon} = \left(\varphi(~~x_i~~x_1),\dots, \varphi(x_n)\right) </math> are implicitly formed feature matrices, <math>\mathbf{K} =\boldsymbol{\Upsilon}^T \boldsymbol{\Upsilon} </math> is the Gram matrix for samples of <math>X</math>, and <math>\lambda</math> is a [[Regularization (mathematics)\|regularization]] parameter needed to avoid [[overfitting]]. Thus, the empirical estimate of the kernel conditional embedding is given by a weighted sum of samples of <math>Y</math> in the feature space: :<math> \widehat{\mu}_{Y\mid x} = \sum_{i=1}^n \beta_i (x) \varphi(y_i) = \boldsymbol{\Phi} \boldsymbol{\beta}(x) </math> where <math> \boldsymbol{\beta}(x) = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{K}_x</math> and <math> \mathbf{K}_x = \left( k(x_1, x), \dots, k(x_n, x) \right)^T </math> Line 77 ⟶ 78: * The expectation of any function <math> f </math> in the RKHS can be computed as an inner product with the kernel embedding: ::<math> \mathbb{E}_X [f(X)] = \langle f, \mu_X \rangle_\mathcal{H} </math> * In the presence of large sample sizes, manipulations of the <math>n \times n</math> Gram matrix may be computationally demanding. Through use of a low-rank approximation of the Gram matrix (such as the [[incomplete Cholesky factorization]]), running time and memory requirements of kernel-embedding-based learning algorithms can be drastically reduced without suffering much loss in approximation accuracy.<ref name = "Song2013"/> Line 84 ⟶ 85: * If <math>k</math> is defined such that <math>f</math> takes values in <math>[0,1]</math> for all <math>f \in \mathcal{H}</math> with <math>\\| f\\|_\mathcal{H} \le 1 </math> (as is the case for the widely used [[radial basis function]] kernels), then with probability at least <math>1-\delta </math>:<ref name="SongThesis" /> ::<math>\\|\mu_X - \widehat{\mu}_X \\|_\mathcal{H} = \sup_{f \in \mathcal{B}(0,1)} \left\| \mathbb{E}_X [f(X)] - \frac{1}{n} \sum_{i=1}^n f(x_i) \right\| \le \frac{2}{n} \mathbb{E}_X \left[ \sqrt{\operatorname{tr} K} \right] + \sqrt{\frac{\log (2/\delta)}{2n}}</math> :where <math>\mathcal{B}(0,1)</math> denotes the unit ball in <math>\mathcal{H}</math> and <math>\mathbf{K} =(k_{ij})</math> is the Gram matrix with <math>k_{ij} = k(x_i, x_j).</math> * The rate of convergence (in RKHS norm) of the empirical kernel embedding to its distribution counterpart is <math>O(n^{-1/2})</math> and does ''not'' depend on the dimension of <math>X</math>. ▼ ▲* The rate of convergence (in RKHS norm) of the empirical kernel embedding to its distribution counterpart is <math>O(n^{-1/2})</math> and does ''not'' depend on the dimension of <math>X</math>. * Statistics based on kernel embeddings thus avoid the [[curse of dimensionality]], and though the true underlying distribution is unknown in practice, one can (with high probability) obtain an approximation within <math>O(n^{-1/2})</math> of the true kernel embedding based on a finite sample of size <math>n</math>. * For the embedding of conditional distributions, the empirical estimate can be seen as a ''weighted'' average of feature mappings (where the weights <math>\beta_i(x) </math> depend on the value of the conditioning variable and capture the effect of the conditioning on the kernel embedding). In this case, the empirical estimate converges to the conditional distribution RKHS embedding with rate <math>O\left(n^{-1/4} \right)</math> if the regularization parameter <math>\lambda</math> is decreased as <math>O\left( n^{-1/2} \right),</math> though faster rates of convergence may be achieved by placing additional assumptions on the joint distribution.<ref name="Song2013"/> === Universal kernels === * Let <math>\mathcal{X} \subseteq \mathbb{R}^{b}</math> be a compact metric space and <math>C(\mathcal{X})</math> the set of [[Function space#Functional analysis\|continuous functions]]. The reproducing kernel <math>k:\mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R} </math> is called '''universal''' if and only if the RKHS <math>\mathcal{H}</math> of <math>k</math> is [[Dense set\|dense]] in <math>C(\mathcal{X})</math>, i.e., for any <math>g \in C(\mathcal{X})</math> and all <math>\varepsilon > 0</math> there exists an <math> f \in \mathcal{H}</math> such that <math> \\| f-g\\|_{\infty} \leq \varepsilon</math>.<ref>{{cite book \|last1=Steinwart \|first1=Ingo \|last2=Christmann \|first2=Andreas \|title=Support Vector Machines \|publisher=Springer \|___location=New York \|year=2008 \|isbn=978-0-387-77241-7 }}</ref> All universal kernels defined on a compact space are characteristic kernels but the converse is not always true.<ref>{{Cite journal\|last1=Sriperumbudur\|first1= B. K.\|last2=Fukumizu\|first2=K.\|last3=Lanckriet\|first3=G.R.G.\|title=Universality, Characteristic Kernels and RKHS Embedding of Measures\|year= 2011 \|journal=Journal of Machine Learning Research\|volume=12\|number=70}}</ref> ~~Letting~~Let <math>~~C(\mathcal{X})~~k</math> ~~denote~~be ~~the~~a ~~space of [[Continuous function\|~~continuous]] [[~~Bounded~~Translational ~~function~~symmetry\|~~bounded]]~~translation ~~functions on [[Compact space\|compact~~invariant]] ~~___domain <math>\mathcal{X}</math>, we call a~~ kernel <math>k~~</math>~~(x, x'~~'universal''~~) if= ~~<math>k~~h(x~~,\cdot~~-x')</math> ~~is continuous for all~~with <math>x \in \mathbb{R}^{b}</math>. ~~and~~Then [[Bochner's theorem]] guarantees the ~~RKHS~~existence ~~induced~~of bya unique finite Borel measure <math>k\mu</math> is(called the [[~~Dense~~Spectral ~~set~~theory of ordinary differential equations#Spectral measure\|~~dense~~spectral measure]]) inon <math>C(\~~mathcal~~mathbb{XR}^{b})</math>. such that ::<math>h(t) = \~~int~~int_{\mathbb{R}^{b}} e^{-i\langle t, \omega \rangle} d\mu(d\omega), \quad \forall t \in \mathbb{R}^{b}.</math>▼ :For <math>k</math> to be universal it suffices that the continuous part of <math>\mu</math> in its unique [[Lebesgue's decomposition theorem\|Lebesgue decomposition]] <math>\mu = \mu_c + \mu_s</math> is non-zero. Furthermore, if ::<math>d\mu_c(\omega) = s(\omega)d\omega,</math> :then <math>s</math> is the [[spectral density]] of frequencies <math>\omega</math> in <math>\mathbb{R}^{b}</math> and <math>h</math> is the [[Fourier transform]] of <math>s</math>. If the [[Support (mathematics)\|support]] of <math>\mu</math> is all of <math>\mathbb{R}^{b}</math>, then <math>k</math> is a characteristic kernel as well.<ref>{{Citation \|last=Liang\|first=Percy\| year=2016\|title=CS229T/STAT231: Statistical Learning Theory \| series = Stanford lecture notes \| url=https://web.stanford.edu/class/cs229t/notes.pdf}}</ref><ref>{{cite conference\|last1=Sriperumbudur\|first1= B. K.\|last2=Fukumizu\|first2=K.\|last3=Lanckriet\|first3=G.R.G.\|title=On the relation between universality, characteristic kernels and RKHS embedding of measures\|url= https://proceedings.mlr.press/v9/sriperumbudur10a.html\|year=2010 \| conference=Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics\|___location=Italy}} </ref><ref>{{cite journal \| last1 = Micchelli\|first1=C.A.\| last2 = Xu\|first2=Y.\| last3 = Zhang\|first3=H.\| title = Universal Kernels \| journal = Journal of Machine Learning Research \| year = 2006 \| volume = 7 \| number = 95\| pages = 2651–2667\| url= http://jmlr.org/papers/v7/micchelli06a.html }}</ref> * If <math>k</math> induces a strictly positive definite kernel matrix for any set of distinct points, then it is a universal kernel.<ref name = "SongThesis" /> For example, the widely used Gaussian RBF kernel ::<math> k(x,x') = \exp\left(-\frac{1}{2\sigma^2} \\|x-x'\\|^2 \right)</math> :on compact subsets of <math>\mathbb{R}^db </math> is universal. * If <math>k</math> is shift-invariant <math>h(x-y)=k(x, y)</math> and its representation in Fourier ___domain is ▲::<math>h(t) = \int e^{-i\langle t, \omega \rangle} \mu(d\omega)</math> :and [[Support (mathematics)\|support]] of <math>\mu</math> is an entire space, then <math>k</math> is universal.<ref>[https://web.stanford.edu/class/cs229t/notes.pdf] page 139</ref> For example, Gaussian RBF is universal, [[sinc]] kernel is not universal. * If <math> k </math> is universal, then it is ''characteristic'', i.e. the kernel embedding is one-to-one.<ref>A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola. (2007). [http://www.gatsby.ucl.ac.uk/~gretton/papers/GreBorRasSchSmo07.pdf A kernel method for the two-sample-problem]. ''Advances in Neural Information Processing Systems'' '''19''', MIT Press, Cambridge, MA.</ref> === Parameter selection for conditional distribution kernel embeddings === Line 111 ⟶ 109: * The empirical kernel conditional distribution embedding operator <math>\widehat{\mathcal{C}}_{Y\|X}</math> can alternatively be viewed as the solution of the following regularized least squares (function-valued) regression problem <ref>S. Grunewalder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, M. Pontil. (2012). [http://icml.cc/2012/papers/898.pdf Conditional mean embeddings as regressors]. ''Proc. Int. Conf. Machine Learning'': 1823–1830.</ref> ::<math>\min_{\mathcal{C}: \mathcal{H} \to \mathcal{H}} \sum_{i=1}^n \left \\|\varphi(y_i)-\mathcal{C} \varphi(x_i) \right \\|_\mathcal{H}^2 + \lambda \\|\mathcal{C} \\|_{HS}^2</math> :where <math>\\|\cdot\\|_{HS}</math> is the [[Hilbert–Schmidt operator\|Hilbert–Schmidt norm]]. * One can thus select the regularization parameter <math>\lambda</math> by performing [[cross-validation (statistics)\|cross-validation]] based on the squared loss function of the regression problem. Line 117 ⟶ 115: == Rules of probability as operations in the RKHS == This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al.<ref name = "Song2013" /><ref name = "SongCDE" /> The following notation is adopted: * <math>P(X,Y)= </math> joint distribution over random variables <math> X, Y </math> * <math>P(X)= \int_\Omega P(X, \mathrm{d}y) = </math> marginal distribution of <math>X</math>; <math>P(Y)= </math> marginal distribution of <math>Y </math> * <math> P(Y \mid X) = \frac{P(X,Y)}{P(X)} = </math> conditional distribution of <math> Y </math> given <math> X </math> with corresponding conditional embedding operator <math> \mathcal{C}_{Y \mid X}</math> * <math> \pi(Y) = </math> prior distribution over <math> Y </math> * <math> Q </math> is used to distinguish distributions which incorporate the prior from distributions <math> P </math> which do not rely on the prior Line 138 ⟶ 133: The analog of this rule in the kernel embedding framework states that <math>\mu_X^\pi,</math> the RKHS embedding of <math>Q(X)</math>, can be computed via :<math>\mu_X^\pi = \mathbb{E~~}_{Y~~} [\mathcal{C}_{X \mid Y} \varphi(Y) ] = \mathcal{C}_{X\mid Y} \mathbb{E~~}_{Y~~} [\varphi(Y)] = \mathcal{C}_{X\mid Y} \mu_Y^\pi </math> where <math>\mu_Y^\pi</math> is the kernel embedding of <math>\pi(Y).</math> In practical implementations, the kernel sum rule takes the following form :<math> \widehat{\mu}_X^\pi = \widehat{\mathcal{C}}_{X \mid Y} \widehat{\mu}_Y^\pi = \boldsymbol{\Upsilon} (\mathbf{G} + \lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} </math> where :<math>\mu_Y^\pi = \sum_{i=1}^{\widetilde{n}} \alpha_i \varphi(\widetilde{y}_i)</math> is the empirical kernel embedding of the prior distribution, <math>\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_{\widetilde{n}} )^T,</math> <math>\boldsymbol{\Upsilon} = \left(\varphi(x_1), \ldots, \varphi(x_n) \right) </math>, and <math>\mathbf{G}, \widetilde{\mathbf{G}} </math> are Gram matrices with entries <math>\mathbf{G}_{ij} = k(y_i, y_j), \widetilde{\mathbf{G}}_{ij} = k(y_i, \widetilde{y}_j) </math> respectively. === Kernel chain rule === In probability theory, a joint distribution can be factorized into a product between conditional and marginal distributions :<math>Q(X,Y) = P(X \mid Y) \pi(Y) </math> Line 157 ⟶ 152: The analog of this rule in the kernel embedding framework states that <math> \mathcal{C}_{XY}^\pi,</math> the joint embedding of <math>Q(X,Y),</math> can be factorized as a composition of conditional embedding operator with the auto-covariance operator associated with <math>\pi(Y)</math> :<math>\mathcal{C}_{XY}^\pi = \mathcal{C}_{X \mid Y} \mathcal{C}_{YY}^\pi </math> where :<math>\mathcal{C}_{XY}^\pi = \mathbb{E~~}_{XY~~} [\varphi(X) \otimes \varphi(Y) ],</math> :<math>\mathcal{C}_{YY}^\pi = \mathbb{E}_Y [\varphi(Y) \otimes \varphi(Y)].</math> In practical implementations, the kernel chain rule takes the following form Line 169 ⟶ 164: === Kernel Bayes' rule === In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as :<math>Q(Y\mid x) = \frac{P(x\mid Y) \pi(Y)}{Q(x)} </math> where <math> Q(x) = \int_\Omega P(x \mid y) \, \mathrm{d} \pi(y) </math> Line 175 ⟶ 170: The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution :<math> \mu_{Y\mid x}^\pi = \mathcal{C}_{Y \mid X}^\pi \varphi(x) = \mathcal{C}_{YX}^\pi \left ( \mathcal{C}_{XX}^\pi \right )^{-1} \varphi(x)</math> where from the chain rule: :<math> \mathcal{C}_{YX}^\pi = \left( \mathcal{C}_{X\mid Y} \mathcal{C}_{YY}^\pi \right)^T.</math> Line 183 ⟶ 178: In practical implementations, the kernel Bayes' rule takes the following form :<math>\widehat{\mu}_{Y\mid x}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( \left (\widehat{\mathcal{C}}_{XX} \right )^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi \varphi(x) = \widetilde{\boldsymbol{\Phi}} \boldsymbol{\Lambda}^T \left( (\mathbf{D} \mathbf{K})^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \mathbf{K} \mathbf{D} \mathbf{K}_x </math> where :<math>\boldsymbol{\Lambda} = \left(\mathbf{G} + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widetilde{\mathbf{G}} \operatorname{diag}(\boldsymbol{\alpha}), \qquad \mathbf{D} = \operatorname{diag}\left(\left(\mathbf{G} + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} \right).</math> Two regularization parameters are used in this framework: <math>\lambda </math> for the estimation of <math> \widehat{\mathcal{C}}_{YX}^\pi, \widehat{\mathcal{C}}_{XX}^\pi = \boldsymbol{\Upsilon} \mathbf{D} \boldsymbol{\Upsilon}^T</math> and <math>\widetilde{\lambda}</math> for the estimation of the final conditional embedding operator :<math>\widehat{\mathcal{C}}_{Y\mid X}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( \left (\widehat{\mathcal{C}}_{XX}^\pi \right )^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi.</math> Line 198 ⟶ 193: === Measuring distance between distributions === The '''maximum mean discrepancy (MMD)''' is a distance-measure between distributions <math>P(X)</math> and <math>Q(Y)</math> which is defined as the ~~squared~~ distance between their embeddings in the RKHS <ref name = "SongThesis" /> :<math>\text{MMD}(P,Q) = \left \\| \mu_X - \mu_Y \right \\|_{\mathcal{H}}^2 .</math> While most distance-measures between distributions such as the widely used [[Kullback–Leibler divergence]] either require density estimation (either parametrically or nonparametrically) or space partitioning/bias correction strategies,<ref name = "SongThesis" /> the MMD is easily estimated as an empirical mean which is concentrated around the true value of the MMD. The characterization of this distance as the ''maximum mean discrepancy'' refers to the fact that computing the MMD is equivalent to finding the RKHS function that maximizes the difference in expectations between the two probability distributions :<math>\text{MMD}(P,Q) = \sup_{\\|f \\|_\mathcal{H} \le 1} \left( \mathbb{E}_X [f(X)] - \mathbb{E}_Y [f(Y)] \right),</math>▼ a form of [[integral probability metric]]. ▲:<math>\text{MMD}(P,Q) = \sup_{\\|f \\|_\mathcal{H} \le 1} \left( \mathbb{E}_X[f(X)] - \mathbb{E}_Y[f(Y)] \right)</math> === Kernel two-sample test === Line 226 ⟶ 221: === Measuring dependence of random variables === A measure of the statistical dependence between random variables <math>X</math> and <math>Y</math> (from any domains on which sensible kernels can be defined) can be formulated based on the Hilbert–Schmidt Independence Criterion <ref>A. Gretton, O. Bousquet, A. Smola, B. Schölkopf. (2005). [http://www.gatsby.ucl.ac.uk/~gretton/papers/GreBouSmoSch05.pdf Measuring statistical dependence with Hilbert–Schmidt norms]. ''Proc. Intl. Conf. on Algorithmic Learning Theory'': 63–78.</ref> :<math> \text{HSIC}(X, Y) = \left \\| \mathcal{C}_{XY} - \mu_X \otimes \mu_Y \right \\|_{\mathcal{H} \otimes \mathcal{H}}^2 </math> and can be used as a principled replacement for [[mutual information]], [[Pearson correlation]] or any other dependence measure used in learning algorithms. Most notably, HSIC can detect arbitrary dependencies (when a characteristic kernel is used in the embeddings, HSIC is zero if and only if the variables are [[independence (probability theory)\|independent]]), and can be used to measure dependence between different types of data (e.g. images and text captions). Given ''n'' i.i.d. samples of each random variable, a simple parameter-free [[Bias of an estimator\|unbiased]] estimator of HSIC which exhibits [[Concentration of measure\|concentration]] about the true value can be computed in <math>O(n(d_f^2 +d_g^2))</math> time,<ref name = "SongThesis"/> where the Gram matrices of the two datasets are approximated using <math>\mathbf{A} \mathbf{A}^T, \mathbf{B} \mathbf{B}^T </math> with <math>\mathbf{A} \in \R^{n \times d_f}, \mathbf{B} \in \R^{n \times d_g}</math>. The desirable properties of HSIC have led to the formulation of numerous algorithms which utilize this dependence measure for a variety of common machine learning tasks such as: [[feature selection]] (BAHSIC <ref>L. Song, A. Smola , A. Gretton, K. Borgwardt, J. Bedo. (2007). [http://www.machinelearning.org/proceedings/icml2007/papers/244.pdf Supervised feature selection via dependence estimation]. ''Proc. Intl. Conf. Machine Learning'', Omnipress: 823–830.</ref>), [[Cluster analysis\|clustering]] (CLUHSIC <ref>L. Song, A. Smola, A. Gretton, K. Borgwardt. (2007). [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_SongSGB07.pdf A dependence maximization view of clustering]. ''Proc. Intl. Conf. Machine Learning''. Omnipress: 815–822.</ref>), and [[dimensionality reduction]] (MUHSIC <ref>L. Song, A. Smola, K. Borgwardt, A. Gretton. (2007). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2007_492.pdf Colored maximum variance unfolding]. ''Neural Information Processing Systems''.</ref>). HSIC can be extended to measure the dependence of multiple random variables. The question of when HSIC captures independence in this case has recently been studied:<ref name = "CharAndUniv">Zoltán Szabó, Bharath K. Sriperumbudur. [http://jmlr.org/papers/v18/17-492.html Characteristic and Universal Tensor Product Kernels]. ''Journal of Machine Learning Research'', 19:1–29, 2018.</ref> for Line 238 ⟶ 233: === Kernel belief propagation === [[Belief propagation]] is a fundamental algorithm for inference in [[graphical model]]s in which nodes repeatedly pass and receive messages corresponding to the evaluation of conditional expectations. In the kernel embedding framework, the messages may be represented as RKHS functions and the conditional distribution embeddings can be applied to efficiently compute message updates. Given ''n'' samples of random variables represented by nodes in a [[Markov random field]], the incoming message to node ''t'' from node ''u'' can be expressed as :<math>m_{ut}(\cdot) = \sum_{i=1}^n \beta_{ut}^i \varphi(x_t^i)</math> if it assumed to lie in the RKHS. The '''kernel belief propagation update''' message from ''t'' to node ''s'' is then given by <ref name = "Song2013"/> Line 248 ⟶ 243: where <math>\odot</math> denotes the element-wise vector product, <math>N(t) \backslash s </math> is the set of nodes connected to ''t'' excluding node ''s'', <math> \boldsymbol{\beta}_{ut} = \left(\beta_{ut}^1, \dots, \beta_{ut}^n \right) </math>, <math>\mathbf{K}_t, \mathbf{K}_s </math> are the Gram matrices of the samples from variables <math>X_t, X_s </math>, respectively, and <math>\boldsymbol{\Upsilon}_s = \left(\varphi(x_s^1),\dots, \varphi(x_s^n)\right)</math> is the feature matrix for the samples from <math>X_s</math>. Thus, if the incoming messages to node ''t'' are linear combinations of feature mapped samples from <math> X_t </math>, then the outgoing message from this node is also a linear combination of feature mapped samples from <math> X_s </math>. This RKHS function representation of message-passing updates therefore produces an efficient belief propagation algorithm in which the [[Markov ~~Random~~random ~~Field~~field#Clique factorization\|potentials]] are nonparametric functions inferred from the data so that arbitrary statistical relationships may be modeled.<ref name = "Song2013"/> === Nonparametric filtering in hidden Markov models === In the [[hidden Markov model]] (HMM), two key quantities of interest are the transition probabilities between hidden states <math> P(S^t \mid S^{t-1})</math> and the emission probabilities <math>P(O^t \mid S^t)</math> for observations. Using the kernel conditional distribution embedding framework, these quantities may be expressed in terms of samples from the HMM. A serious limitation of the embedding methods in this ___domain is the need for training samples containing hidden states, as otherwise inference with arbitrary distributions in the HMM is not possible. One common use of HMMs is [[Hidden Markov ~~Model~~model#Filtering\|filtering]] in which the goal is to estimate posterior distribution over the hidden state <math>s^t</math> at time step ''t'' given a history of previous observations <math>h^t = (o^1, \dots, o^t)</math> from the system. In filtering, a '''belief state''' <math>P(S^{t+1} \mid h^{t+1})</math> is recursively maintained via a prediction step (where updates <math>P(S^{t+1} \mid h^t) = \mathbb{E}~~_{S^t \mid h^t}~~ [P(S^{t+1} \mid S^t) \mid h^t]</math> are computed by marginalizing out the previous hidden state) followed by a conditioning step (where updates <math> P(S^{t+1} \mid h^t, o^{t+1}) \propto P(o^{t+1} \mid S^{t+1}) P(S^{t+1} \mid h^t) </math> are computed by applying Bayes' rule to condition on a new observation).<ref name = "Song2013"/> The RKHS embedding of the belief state at time ''t+1'' can be recursively expressed as :<math>\mu_{S^{t+1} \mid h^{t+1}} = \mathcal{C}_{S^{t+1} O^{t+1}}^\pi \left(\mathcal{C}_{O^{t+1} O^{t+1}}^\pi \right)^{-1} \varphi(o^{t+1}) </math> by computing the embeddings of the prediction step via the [[#Kernel ~~Sum~~sum ~~Rule~~rule\|kernel sum rule]] and the embedding of the conditioning step via [[#Kernel Bayes' ~~Rule~~rule\|kernel Bayes' rule]]. Assuming a training sample <math>(\widetilde{s}^1, \dots, \widetilde{s}^T, \widetilde{o}^1, \dots, \widetilde{o}^T) </math> is given, one can in practice estimate :<math>\widehat{\mu}_{S^{t+1} \mid h^{t+1}} = \sum_{i=1}^T \alpha_i^t \varphi(\widetilde{s}^t)</math> and filtering with kernel embeddings is thus implemented recursively using the following updates for the weights <math>\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_T)</math> <ref name = "Song2013"/> Line 271 ⟶ 266: The '''support measure machine''' (SMM) is a generalization of the [[support vector machine]] (SVM) in which the training examples are probability distributions paired with labels <math> \{P_i, y_i\}_{i=1}^n, \ y_i \in \{+1, -1\} </math>.<ref name = "SMM">K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf. (2012). [http://books.nips.cc/papers/files/nips25/NIPS2012_0015.pdf Learning from Distributions via Support Measure Machines]. ''Advances in Neural Information Processing Systems'': 10–18.</ref> SMMs solve the standard SVM [[Support vector machine#Dual form\|dual optimization problem]] using the following '''expected kernel''' :<math> K\left(P(X), Q(Z)\right) = \langle \mu_X , \mu_Z \rangle_\mathcal{H} = \mathbb{E~~}_{XZ~~} [k(x,z)] </math> which is computable in closed form for many common specific distributions <math> P_i </math> (such as the Gaussian distribution) combined with popular embedding kernels <math>k</math> (e.g. the Gaussian kernel or polynomial kernel), or can be accurately empirically estimated from i.i.d. samples <math>\{x_i\}_{i=1}^n \sim P(X), \{z_j\}_{j=1}^m \sim Q(Z) </math> via Line 287 ⟶ 282: By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated. Covariate shift may be accounted for by reweighting examples via estimates of the ratio <math>P^\text{te}(X)/P^\text{tr}(X)</math> obtained directly from the kernel embeddings of the marginal distributions of <math>X</math> in each ___domain without any need for explicit estimation of the distributions.<ref name = "CovS"/> Target shift, which cannot be similarly dealt with since no samples from <math>Y</math> are available in the test ___domain, is accounted for by weighting training examples using the vector <math>\boldsymbol{\beta}^(\mathbf{y}^\text{tr}) </math> which solves the following optimization problem (where in practice, empirical approximations must be used) <ref name = "DA"/> :<math>\min_{\boldsymbol{\beta}(y)} \left \\|\mathcal{C}_{{(X \mid Y)}^\text{tr}} \mathbb{E~~}_{Y^\text{tr}~~} [\boldsymbol{\beta}(y) \varphi(y^\text{tr})] - \mu_{X^\text{te}} \right \\|_\mathcal{H}^2</math> subject to <math>\boldsymbol{\beta}(y) \ge 0, \mathbb{E~~}_{Y^\text{tr}~~} [\boldsymbol{\beta}(y^\text{tr})] = 1</math> To deal with ___location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data <math> \mathbf{X}^\text{new} = \mathbf{X}^\text{tr} \odot \mathbf{W} + \mathbf{B}</math> (where <math>\odot</math> denotes the element-wise vector product). To ensure similar distributions between the new transformed training samples and the test data, <math>\mathbf{W},\mathbf{B}</math> are estimated by minimizing the following empirical kernel embedding distance <ref name = "DA"/> Line 298 ⟶ 293: Given ''N'' sets of training examples sampled i.i.d. from distributions <math>P^{(1)}(X,Y), P^{(2)}(X,Y), \ldots, P^{(N)}(X,Y)</math>, the goal of '''___domain generalization''' is to formulate learning algorithms which perform well on test examples sampled from a previously unseen ___domain <math>P^(X,Y)</math> where no data from the test ___domain is available at training time. If conditional distributions <math>P(Y \mid X)</math> are assumed to be relatively similar across all domains, then a learner capable of ___domain generalization must estimate a functional relationship between the variables which is robust to changes in the marginals <math>P(X)</math>. Based on kernel embeddings of these distributions, Domain Invariant Component Analysis (DICA) is a method which determines the transformation of the training data that minimizes the difference between marginal distributions while preserving a common conditional distribution shared between all training domains.<ref name = "DICA">K. Muandet, D. Balduzzi, B. Schölkopf. (2013).[http://jmlr.org/proceedings/papers/v28/muandet13.pdf Domain Generalization Via Invariant Feature Representation]. ''30th International Conference on Machine Learning''.</ref> DICA thus extracts ''invariants'', features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as [[kernel principal component analysis]], transfer component analysis, and covariance operator inverse regression.<ref name = "DICA"/> Defining a probability distribution <math>\mathcal{P}</math> on the RKHS <math>\mathcal{H}</math> with :<math>\mathcal{P} \left (\mu_{X^{(i)}Y^{(i)}} \right ) = \frac{1}{N} \qquad \text{ for } i=1,\dots, N,</math> DICA measures dissimilarity between domains via '''distributional variance''' which is computed as :<math>V_\mathcal{H} (\mathcal{P}) = \frac{1}{N} \operatorname{tr}(\mathbf{G}) - \frac{1}{N^2} \sum_{i,j=1}^N \mathbf{G}_{ij} </math> where :<math>\mathbf{G}_{ij} = \left \langle \mu_{X^{(i)}}, \mu_{X^{(j)}} \right \rangle_\mathcal{H} </math> so <math>\mathbf{G}</math> is a <math>N \times N</math> Gram matrix over the distributions from which the training data are sampled. Finding an [[Orthogonal matrix\|orthogonal transform]] onto a low-dimensional [[Linear subspace\|subspace]] ''B'' (in the feature space) which minimizes the distributional variance, DICA simultaneously ensures that ''B'' aligns with the [[Basis function\|bases]] of a '''central subspace''' ''C'' for which <math>Y</math> becomes independent of <math>X</math> given <math>C^T X</math> across all domains. In the absence of target values <math>Y</math>, an unsupervised version of DICA may be formulated which finds a low-dimensional subspace that minimizes distributional variance while simultaneously maximizing the variance of <math>X</math> (in the feature space) across all domains (rather than preserving a central subspace).<ref name = "DICA"/> === Distribution regression === In distribution regression, the goal is to regress from probability distributions to reals (or vectors). Many important [[machine learning]] and statistical tasks fit into this framework, including [[Multiple-instance learning\|multi-instance learning]], and [[point estimation]] problems without analytical solution (such as [[Hyperparameter (Bayesian statistics)\|hyperparameter]] or [[entropy estimation]]). In practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between ''sets of points''. Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.<ref name = "MERR">Z. Szabó, B. Sriperumbudur, B. Póczos, A. Gretton. [http://jmlr.org/papers/v17/14-510.html Learning Theory for Distribution Regression]. ''Journal of Machine Learning Research'', 17(152):1–40, 2016.</ref> Given <math>{\left(\{X_{i,n}\}_{n=1}^{N_i}, y_i\right)}_{i=1}^\ell</math> training data, where the <math>\hat{X_i} := \{X_{i,n}\}_{n=1}^{N_i}</math> bag contains samples from a probability distribution <math>X_i</math> and the <math>i^\text{th}</math> output label is <math>y_i\in \R</math>, one can tackle the distribution regression task by taking the embeddings of the distributions, and learning the regressor from the embeddings to the outputs. In other words, one can consider the following kernel [[Tikhonov regularization\|ridge regression]] problem <math>(\lambda>0)</math> Line 319 ⟶ 314: :<math>J(f) = \frac{1}{\ell} \sum_{i=1}^\ell \left[f\left(\mu_{\hat{X_i}}\right)-y_i\right]^2 + \lambda \\|f\\|_{\mathcal{H}(K)}^2 \to \min_{f\in \mathcal{H}(K)}, </math> where :<math>\mu_{\hat{X}_i} = \int_\Omega k(\cdot,u) \, \mathrm{d} \hat{X}_i(u)= \frac{1}{N_i} \sum_{n=1}^{N_i} k(\cdot, X_{i,n})</math> Line 332 ⟶ 327: In this simple example, which is taken from Song et al.,<ref name = "Song2013"/> <math>X, Y</math> are assumed to be [[Probability distribution#Discrete probability distribution\|discrete random variables]] which take values in the set <math>\{1,\ldots,K\} </math> and the kernel is chosen to be the [[Kronecker delta]] function, so <math>k(x,x') = \delta(x,x')</math>. The feature map corresponding to this kernel is the [[standard basis]] vector <math>\varphi(x) = \mathbf{e}_x</math>. The kernel embeddings of such a distributions are thus vectors of marginal probabilities while the embeddings of joint distributions in this setting are <math>K\times K </math> matrices specifying joint probability tables, and the explicit form of these embeddings is :<math>\mu_X = \mathbb{E}_X [\mathbf{e}_X] = \begin{pmatrix} P(X=1) \\ \vdots \\ P(X=K) \\ \end{pmatrix}</math> :<math>\mathcal{C}_{XY} = \mathbb{E~~}_{XY~~} [\mathbf{e}_X \otimes \mathbf{e}_Y] = ( P(X=s, Y=t))_{s,t \in \{1,\ldots,K\}} </math> ~~The~~When <math> P(X=s)>0 </math>, for all <math> s \in \{1,\ldots,K\} </math>, the conditional distribution embedding operator, :<math>\mathcal{C}_{Y\mid X} = \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1},</math> is in this setting a conditional probability table Line 343 ⟶ 338: :<math>\mathcal{C}_{Y \mid X} = ( P(Y=s \mid X=t))_{s,t \in \{1,\dots,K\}}</math> and :<math>\mathcal{C}_{XX} =\begin{pmatrix} P(X=1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & P(X=K) \\ \end{pmatrix}</math> Line 353 ⟶ 348: In this discrete-valued setting with the Kronecker delta kernel, the [[#Rules of probability as operations in the RKHS\|kernel sum rule]] becomes :<math>\underbrace{\begin{pmatrix} QP(X=1) \\ \vdots \\ P(X = N) \\ \end{pmatrix}}_{\mu_X^\pi} = \underbrace{\begin{pmatrix} \\ P(X=s \mid Y=t) \\ \\ \end{pmatrix}}_{\mathcal{C}_{X\mid Y}} \underbrace{\begin{pmatrix} \pi(Y=1) \\ \vdots \\ \pi(Y = N) \\ \end{pmatrix}}_{ \mu_Y^\pi}</math> The [[#Rules of probability as operations in the RKHS\|kernel chain rule]] in this case is given by :<math>\underbrace{\begin{pmatrix} \\ QP(X=s,Y=t) \\ \\ \end{pmatrix} }_{\mathcal{C}_{XY}^\pi} = \underbrace{\begin{pmatrix} \\ P(X=s \mid Y=t) \\ \\ \end{pmatrix} }_{\mathcal{C}_{X \mid Y}} \underbrace{ \begin{pmatrix} \pi(Y=1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \pi(Y=K) \\ \end{pmatrix} }_{\mathcal{C}_{YY}^\pi} </math>