Content deleted Content added
m →Example |
|||
Line 96:
== Rules of probability as operations in the RKHS ==
This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al.<ref name = "Song2013" /><ref name = "SongCDE" />
* <math>
* <math>P(X)= \int_\Omega P(X, \mathrm{d}y) = </math> marginal distribution of <math> * <math> P(Y
* <math> \pi(Y) = </math> prior distribution over <math> Y </math>
* <math> Q </math> is used to distinguish distributions which incorporate the prior from distributions <math> P </math> which do not rely on the prior
In practice, all embeddings are empirically estimated from data === Kernel sum rule ===
In probability theory, the marginal distribution of <math>
:: <math> Q(X) = \int_\Omega P(X \mid Y) \mathrm{d} \pi(Y) </math>▼
:<math> Q(X) = \int_\Omega P(X | Y) \mathrm{d} \pi(Y) </math>
The analog of this rule in the kernel embedding framework states that <math> :
In practical implementations, the kernel sum rule takes the following form▼
:: <math> \widehat{\mu}_X^\pi = \widehat{\mathcal{C}}_{X \mid Y} \widehat{\mu}_Y^\pi = \boldsymbol{\Upsilon} (\mathbf{G} + \lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} </math> ▼
▲where <math>\mu_Y^\pi</math> is the kernel embedding of <math>\pi(Y).</math> In practical implementations, the kernel sum rule takes the following form
where <math>\mu_Y^\pi = \sum_{i=1}^{\widetilde{n}} \alpha_i \phi(\widetilde{y}_i)</math> is the empirical kernel embedding of the prior distribution, <math>\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_{\widetilde{n}} )^T</math>, <math>\boldsymbol{\Upsilon} = \left(\phi(x_1), \dots, \phi(x_n) \right) </math>, and <math>\mathbf{G}, \widetilde{\mathbf{G}} </math> are Gram matrices with entries <math>\mathbf{G}_{ij} = k(y_i, y_j), \widetilde{\mathbf{G}}_{ij} = k(y_i, \widetilde{y}_j) </math> respectively.▼
▲:
where
:<math>\mu_Y^\pi = \sum_{i=1}^{\widetilde{n}} \alpha_i \phi(\widetilde{y}_i)</math>
▲
=== Kernel chain rule ===
In probability theory, a joint distribution can be factorized into a product between conditional and marginal distributions
:<math>Q(X,Y) = P(X | Y) \pi(Y) </math>
The analog of this rule in the kernel embedding framework states that <math> \mathcal{C}_{XY}^\pi :<math>\mathcal{C}_{XY}^\pi = \mathcal{C}_{X | Y} \mathcal{C}_{YY}^\pi </math>
where
:<math>\mathcal{C}_{XY}^\pi = \mathbb{E}_{XY} [\phi(X) \otimes \phi(Y) ],</math>
:<math>\mathcal{C}_{YY}^\pi = \mathbb{E}_Y [\phi(Y) \otimes \phi(Y)].</math>
In practical implementations, the kernel chain rule takes the following form
:
=== Kernel Bayes' rule ===
In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as
▲:
The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution
:
where from the chain rule:
:<math> \mathcal{C}_{YX}^\pi = \left( \mathcal{C}_{X|Y} \mathcal{C}_{YY}^\pi \right)^T.</math>
In practical implementations, the kernel Bayes' rule takes the following form
:: <math> \widehat{\mu}_{Y \mid x}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( (\widehat{\mathcal{C}}_{XX})^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi \phi(x) = \widetilde{\boldsymbol{\Phi}} \boldsymbol{\Lambda}^T \left( (\mathbf{D} \mathbf{K})^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \mathbf{K} \mathbf{D} \mathbf{K}_x </math> ▼
Two regularization parameters are used in this framework: <math>\lambda </math> for the estimation of <math> \widehat{\mathcal{C}}_{YX}^\pi, \widehat{\mathcal{C}}_{XX}^\pi = \boldsymbol{\Upsilon} \mathbf{D} \boldsymbol{\Upsilon}^T </math> and <math>\widetilde{\lambda} </math> for the estimation of the final conditional embedding operator <math> \widehat{\mathcal{C}}_{Y \mid X}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( (\widehat{\mathcal{C}}_{XX}^\pi )^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi </math>. The latter regularization is done on square of <math> \widehat{\mathcal{C}}_{XX}^\pi </math> because <math> D </math> may not be [[Positive-definite matrix|positive definite]].▼
where
:<math>\boldsymbol{\Lambda} = \left(\mathbf{G} + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widetilde{\mathbf{G}} \text{diag}(\boldsymbol{\alpha}), \qquad \mathbf{D} = \text{diag}\left(\left(\mathbf{G} + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} \right).</math>
▲Two regularization parameters are used in this framework: <math>\lambda </math> for the estimation of <math> \widehat{\mathcal{C}}_{YX}^\pi, \widehat{\mathcal{C}}_{XX}^\pi = \boldsymbol{\Upsilon} \mathbf{D} \boldsymbol{\Upsilon}^T
▲:
The latter regularization is done on square of <math>\widehat{\mathcal{C}}_{XX}^\pi</math> because <math>D</math> may not be [[Positive-definite matrix|positive definite]].
==Applications==
|