Kernel embedding of distributions: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 18:
:<math>\forall f \in \mathcal{H}, \forall x \in \Omega \qquad \langle f, k(x,\cdot) \rangle_\mathcal{H} = f(x).</math>
 
One may alternatively consider <math>k(x,\cdot)</math> an implicit feature mapping <math>\phivarphi(x)</math> from <math>\Omega</math> to <math> \mathcal{H} </math> (which is therefore also called the feature space), so that <math>k(x, x') = \langle \phivarphi(x), \phivarphi(x')\rangle_\mathcal{H}</math> can be viewed as a measure of similarity between points <math>x, x' \in \Omega.</math> While the [[similarity measure]] is linear in the feature space, it may be highly nonlinear in the original space depending on the choice of kernel.
 
===Kernel embedding===
The kernel embedding of the distribution <math>P(X)</math> in <math> \mathcal{H} </math> (also called the '''kernel mean''' or '''mean map''') is given by:<ref name = "Smola2007" />
 
:<math>\mu_X := \mathbb{E}_X [k(X, \cdot) ] = \mathbb{E}_X [\phivarphi(X) ] = \int_\Omega \phivarphi(x) \ \mathrm{d}P(x) </math>
 
If <math>P</math> allows a square integrable density <math>p</math>, then <math>\mu_X = \mathcal{E}_k p</math>, where <math>\mathcal{E}_k</math> is the [[Hilbert–Schmidt integral operator]]. A kernel is ''characteristic'' if the mean embedding <math>\mu: \{\text{family of distributions over }\Omega \} \to \mathcal{H} </math> is injective.<ref name = "Fukumizu2008">K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf (2008). [http://papers.nips.cc/paper/3340-kernel-measures-of-conditional-dependence.pdf Kernel measures of conditional independence]. ''Advances in Neural Information Processing Systems'' '''20''', MIT Press, Cambridge, MA.</ref> Each distribution can thus be uniquely represented in the RKHS and all statistical features of distributions are preserved by the kernel embedding if a characteristic kernel is used.
Line 33:
 
===Joint distribution embedding===
If <math>Y</math> denotes another random variable (for simplicity, assume the co-___domain of <math>Y</math> is also <math>\Omega</math> with the same kernel <math>k</math> which satisfies <math> \langle \phivarphi(x) \otimes \phivarphi(y), \phivarphi(x') \otimes \phivarphi(y') \rangle = k(x,x') \otimes k(y,y')</math>), then the [[Joint probability distribution|joint distribution]] <math> P(X,Y) </math> can be mapped into a [[tensor product]] feature space <math>\mathcal{H} \otimes \mathcal{H} </math> via <ref name = "Song2013"/>
 
:<math> \mathcal{C}_{XY} = \mathbb{E}_{XY} [\phivarphi(X) \otimes \phivarphi(Y)] = \int_{\Omega \times \Omega} \phivarphi(x) \otimes \phivarphi(y) \ \mathrm{d} P(x,y) </math>
 
By the equivalence between a [[tensor]] and a [[linear map]], this joint embedding may be interpreted as an uncentered [[cross-covariance]] operator <math>\mathcal{C}_{XY}: \mathcal{H} \to \mathcal{H}</math> from which the cross-covariance of mean-zero functions <math>f,g \in \mathcal{H}</math> can be computed as <ref name = "SongCDE">L. Song, J. Huang, A. J. Smola, K. Fukumizu. (2009).[http://www.stanford.edu/~jhuang11/research/pubs/icml09/icml09.pdf Hilbert space embeddings of conditional distributions]. ''Proc. Int. Conf. Machine Learning''. Montreal, Canada: 961–968.</ref>
Line 43:
Given <math>n</math> pairs of training examples <math>\{(x_1, y_1), \dots, (x_n, y_n)\} </math> drawn i.i.d. from <math>P</math>, we can also empirically estimate the joint distribution kernel embedding via
 
:<math>\widehat{\mathcal{C}}_{XY} = \frac{1}{n} \sum_{i=1}^n \phivarphi(x_i) \otimes \phivarphi(y_i) </math>
 
===Conditional distribution embedding===
Given a [[conditional distribution]] <math>P(Y|\mid X),</math> one can define the corresponding RKHS embedding as <ref name = "Song2013"/>
 
:<math>\mu_{Y |\mid x} = \mathbb{E}_{Y \mid x} [ \varphi(Y) ] = \int_\Omega \varphi(y) \ \mathrm{d}P(y \mid x) </math>
 
Note that the embedding of <math>P(Y|\mid X) </math> thus defines a family of points in the RKHS indexed by the values <math>x</math> taken by conditioning variable <math>X</math>. By fixing <math>X</math> to a particular value, we obtain a single element in <math>\mathcal{H}</math>, and thus it is natural to define the operator
 
:<math>\begin{cases} \mathcal{C}_{Y|\mid X}: \mathcal{H} \to \mathcal{H} \\ \mathcal{C}_{Y|\mid X} = \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1} \end{cases}</math>
 
which given the feature mapping of <math>x</math> outputs the conditional embedding of <math>Y</math> given <math>X = x.</math> Assuming that for all <math>g \in \mathcal{H}: \mathbb{E}_{Y |\mid X} [g(Y)] \in \mathcal{H},</math> it can be shown that <ref name = "SongCDE" />
 
:<math> \mu_{Y \mid x} = \mathcal{C}_{Y \mid X} \varphi(x)</math>
 
This assumption is always true for finite domains with characteristic kernels, but may not necessarily hold for continuous domains.<ref name = "Song2013"/> Nevertheless, even in cases where the assumption fails, <math> \mathcal{C}_{Y |\mid X} \phivarphi(x) </math> may still be used to approximate the conditional kernel embedding <math>\mu_{Y |\mid x},</math> and in practice, the inversion operator is replaced with a regularized version of itself <math>(\mathcal{C}_{XX} + \lambda \mathbf{I})^{-1} </math> (where <math>\mathbf{I}</math> denotes the [[identity matrix]]).
 
Given training examples <math>\{(x_1, y_1),\dots, (x_n, y_n)\},</math> the empirical kernel conditional embedding operator may be estimated as <ref name = "Song2013" />
 
:<math>\widehat{C}_{Y|\mid X} = \boldsymbol{\Phi} (\mathbf{K} + \lambda \mathbf{I})^{-1} \boldsymbol{\Upsilon}^T</math>
 
where <math>\boldsymbol{\Phi} = \left(\phivarphi(y_i),\dots, (y_n)\right), \boldsymbol{\Upsilon} = \left(\phivarphi(x_i),\dots, (x_n)\right) </math> are implicitly formed feature matrices, <math>\mathbf{K} =\boldsymbol{\Upsilon}^T \boldsymbol{\Upsilon} </math> is the Gram matrix for samples of <math>X</math>, and <math>\lambda</math> is a [[Regularization (mathematics)|regularization]] parameter needed to avoid [[overfitting]].
 
Thus, the empirical estimate of the kernel conditional embedding is given by a weighted sum of samples of <math>Y</math> in the feature space:
 
:<math> \widehat{\mu}_{Y|\mid x} = \sum_{i=1}^n \beta_i (x) \phivarphi(y_i) = \boldsymbol{\Phi} \boldsymbol{\beta}(x) </math>
 
where <math> \boldsymbol{\beta}(x) = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{K}_x</math> and <math> \mathbf{K}_x = \left( k(x_1, x), \dots, k(x_n, x) \right)^T </math>
Line 108:
 
* The empirical kernel conditional distribution embedding operator <math>\widehat{\mathcal{C}}_{Y|X}</math> can alternatively be viewed as the solution of the following regularized least squares (function-valued) regression problem <ref>S. Grunewalder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, M. Pontil. (2012). [http://icml.cc/2012/papers/898.pdf Conditional mean embeddings as regressors]. ''Proc. Int. Conf. Machine Learning'': 1823–1830.</ref>
::<math>\min_{\mathcal{C}: \mathcal{H} \to \mathcal{H}} \sum_{i=1}^n \left \|\phivarphi(y_i)-\mathcal{C} \phivarphi(x_i) \right \|_\mathcal{H}^2 + \lambda \|\mathcal{C} \|_{HS}^2</math>
:where <math>\|\cdot\|_{HS}</math> is the [[Hilbert–Schmidt operator|Hilbert–Schmidt norm]].
 
Line 121:
* <math>P(X)= \int_\Omega P(X, \mathrm{d}y) = </math> marginal distribution of <math>X</math>; <math>P(Y)= </math> marginal distribution of <math>Y </math>
 
* <math> P(Y |\mid X) = \frac{P(X,Y)}{P(X)} = </math> conditional distribution of <math> Y </math> given <math> X </math> with corresponding conditional embedding operator <math> \mathcal{C}_{Y |\mid X}</math>
 
* <math> \pi(Y) = </math> prior distribution over <math> Y </math>
Line 132:
In probability theory, the marginal distribution of <math>X</math> can be computed by integrating out <math> Y </math> from the joint density (including the prior distribution on <math>Y</math>)
 
:<math> Q(X) = \int_\Omega P(X |\mid Y) \mathrm{d} \pi(Y) </math>
 
The analog of this rule in the kernel embedding framework states that <math>\mu_X^\pi,</math> the RKHS embedding of <math>Q(X)</math>, can be computed via
 
:<math>\mu_X^\pi = \mathbb{E}_{Y} [\mathcal{C}_{X |\mid Y} \phivarphi(Y) ] = \mathcal{C}_{X|\mid Y} \mathbb{E}_{Y} [\phivarphi(Y)] = \mathcal{C}_{X|\mid Y} \mu_Y^\pi </math>
 
where <math>\mu_Y^\pi</math> is the kernel embedding of <math>\pi(Y).</math> In practical implementations, the kernel sum rule takes the following form
 
:<math> \widehat{\mu}_X^\pi = \widehat{\mathcal{C}}_{X |\mid Y} \widehat{\mu}_Y^\pi = \boldsymbol{\Upsilon} (\mathbf{G} + \lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \boldsymbol{\alpha} </math>
 
where
 
:<math>\mu_Y^\pi = \sum_{i=1}^{\widetilde{n}} \alpha_i \phivarphi(\widetilde{y}_i)</math>
 
is the empirical kernel embedding of the prior distribution, <math>\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_{\widetilde{n}} )^T,</math> <math>\boldsymbol{\Upsilon} = \left(\phivarphi(x_1), \ldots, \phivarphi(x_n) \right) </math>, and <math>\mathbf{G}, \widetilde{\mathbf{G}} </math> are Gram matrices with entries <math>\mathbf{G}_{ij} = k(y_i, y_j), \widetilde{\mathbf{G}}_{ij} = k(y_i, \widetilde{y}_j) </math> respectively.
 
=== Kernel chain rule ===
In probability theory, a joint distribution can be factorized into a product between conditional and marginal distributions
 
:<math>Q(X,Y) = P(X |\mid Y) \pi(Y) </math>
 
The analog of this rule in the kernel embedding framework states that <math> \mathcal{C}_{XY}^\pi,</math> the joint embedding of <math>Q(X,Y),</math> can be factorized as a composition of conditional embedding operator with the auto-covariance operator associated with <math>\pi(Y)</math>
Line 159:
where
 
:<math>\mathcal{C}_{XY}^\pi = \mathbb{E}_{XY} [\phivarphi(X) \otimes \phivarphi(Y) ],</math>
:<math>\mathcal{C}_{YY}^\pi = \mathbb{E}_Y [\phivarphi(Y) \otimes \phivarphi(Y)].</math>
 
In practical implementations, the kernel chain rule takes the following form
 
:<math> \widehat{\mathcal{C}}_{XY}^\pi = \widehat{\mathcal{C}}_{X |\mid Y} \widehat{\mathcal{C}}_{YY}^\pi = \boldsymbol{\Upsilon} (\mathbf{G} + \lambda \mathbf{I})^{-1} \widetilde{\mathbf{G}} \text{diag}(\boldsymbol{\alpha}) \boldsymbol{\widetilde{\Phi}}^T </math>
 
=== Kernel Bayes' rule ===
In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as
 
:<math>Q(Y|\mid x) = \frac{P(x|\mid Y) \pi(Y)}{Q(x)} </math> where <math> Q(x) = \int_\Omega P(x |\mid y) \mathrm{d} \pi(y) </math>
 
The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution
 
:<math> \mu_{Y|\mid x}^\pi = \mathcal{C}_{Y |\mid X}^\pi \phivarphi(x) = \mathcal{C}_{YX}^\pi \left ( \mathcal{C}_{XX}^\pi \right )^{-1} \phivarphi(x)</math>
 
where from the chain rule:
 
:<math> \mathcal{C}_{YX}^\pi = \left( \mathcal{C}_{X|\mid Y} \mathcal{C}_{YY}^\pi \right)^T.</math>
 
In practical implementations, the kernel Bayes' rule takes the following form
 
:<math>\widehat{\mu}_{Y|\mid x}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( \left (\widehat{\mathcal{C}}_{XX} \right )^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi \phivarphi(x) = \widetilde{\boldsymbol{\Phi}} \boldsymbol{\Lambda}^T \left( (\mathbf{D} \mathbf{K})^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \mathbf{K} \mathbf{D} \mathbf{K}_x </math>
 
where
Line 189:
Two regularization parameters are used in this framework: <math>\lambda </math> for the estimation of <math> \widehat{\mathcal{C}}_{YX}^\pi, \widehat{\mathcal{C}}_{XX}^\pi = \boldsymbol{\Upsilon} \mathbf{D} \boldsymbol{\Upsilon}^T</math> and <math>\widetilde{\lambda}</math> for the estimation of the final conditional embedding operator
 
:<math>\widehat{\mathcal{C}}_{Y|\mid X}^\pi = \widehat{\mathcal{C}}_{YX}^\pi \left( \left (\widehat{\mathcal{C}}_{XX}^\pi \right )^2 + \widetilde{\lambda} \mathbf{I} \right)^{-1} \widehat{\mathcal{C}}_{XX}^\pi.</math>
 
The latter regularization is done on square of <math>\widehat{\mathcal{C}}_{XX}^\pi</math> because <math>D</math> may not be [[Positive-definite matrix|positive definite]].
Line 207:
Given ''n'' training examples from <math>P(X)</math> and ''m'' samples from <math>Q(Y)</math>, one can formulate a test statistic based on the empirical estimate of the MMD
 
:<math>\widehat{\text{MMD}}(P,Q) = \left\| \frac{1}{n}\sum_{i=1}^n \phivarphi(x_i) - \frac{1}{m}\sum_{i=1}^m \phivarphi(y_i) \right \|_{\mathcal{H}}^2 = \frac{1}{n^2} \sum_{i=1}^n\sum_{j=1}^n k(x_i, x_j) + \frac{1}{m^2} \sum_{i=1}^m\sum_{j=1}^m k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^n\sum_{j=1}^m k(x_i, y_j) </math>
 
to obtain a '''two-sample test''' <ref>A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola. (2012). [http://jmlr.org/papers/volume13/gretton12a/gretton12a.pdf A kernel two-sample test]. ''Journal of Machine Learning Research'', '''13''': 723–773.</ref> of the null hypothesis that both samples stem from the same distribution (i.e. <math>P = Q</math>) against the broad alternative <math>P \neq Q</math>.
Line 231:
 
=== Kernel belief propagation ===
[[Belief propagation]] is a fundamental algorithm for inference in [[graphical modelsmodel]]s in which nodes repeatedly pass and receive messages corresponding to the evaluation of conditional expectations. In the kernel embedding framework, the messages may be represented as RKHS functions and the conditional distribution embeddings can be applied to efficiently compute message updates. Given ''n'' samples of random variables represented by nodes in a [[Markov Randomrandom Fieldfield]], the incoming message to node ''t'' from node ''u'' can be expressed as
 
:<math>m_{ut}(\cdot) = \sum_{i=1}^n \beta_{ut}^i \phivarphi(x_t^i)</math>
 
if it assumed to lie in the RKHS. The '''kernel belief propagation update''' message from ''t'' to node ''s'' is then given by <ref name = "Song2013"/>
 
:<math> \widehat{m}_{ts} = \left( \odot_{u \in N(t) \backslash s} \mathbf{K}_t \boldsymbol{\beta}_{ut} \right)^T (\mathbf{K}_s + \lambda \mathbf{I} )^{-1} \boldsymbol{\Upsilon}_s^T \phivarphi(x_s)</math>
 
where <math>\odot</math> denotes the element-wise vector product, <math>N(t) \backslash s </math> is the set of nodes connected to ''t'' excluding node ''s'', <math> \boldsymbol{\beta}_{ut} = \left(\beta_{ut}^1, \dots, \beta_{ut}^n \right) </math>, <math>\mathbf{K}_t, \mathbf{K}_s </math> are the Gram matrices of the samples from variables <math>X_t, X_s </math>, respectively, and <math>\boldsymbol{\Upsilon}_s = \left(\phivarphi(x_s^1),\dots, \phivarphi(x_s^n)\right)</math> is the feature matrix for the samples from <math>X_s</math>.
 
Thus, if the incoming messages to node ''t'' are linear combinations of feature mapped samples from <math> X_t </math>, then the outgoing message from this node is also a linear combination of feature mapped samples from <math> X_s </math>. This RKHS function representation of message-passing updates therefore produces an efficient belief propagation algorithm in which the [[Markov Random Field#Clique factorization|potentials]] are nonparametric functions inferred from the data so that arbitrary statistical relationships may be modeled.<ref name = "Song2013"/>
Line 248:
One common use of HMMs is [[Hidden Markov Model#Filtering|filtering]] in which the goal is to estimate posterior distribution over the hidden state <math>s^t</math> at time step ''t'' given a history of previous observations <math>h^t = (o^1, \dots, o^t)</math> from the system. In filtering, a '''belief state''' <math>P(S^{t+1} \mid h^{t+1})</math> is recursively maintained via a prediction step (where updates <math>P(S^{t+1} \mid h^t) = \mathbb{E}_{S^t \mid h^t} [P(S^{t+1} \mid S^t)]</math> are computed by marginalizing out the previous hidden state) followed by a conditioning step (where updates <math> P(S^{t+1} \mid h^t, o^{t+1}) \propto P(o^{t+1} \mid S^{t+1}) P(S^{t+1} \mid h^t) </math> are computed by applying Bayes' rule to condition on a new observation).<ref name = "Song2013"/> The RKHS embedding of the belief state at time ''t+1'' can be recursively expressed as
 
:<math>\mu_{S^{t+1} \mid h^{t+1}} = \mathcal{C}_{S^{t+1} O^{t+1}}^\pi \left(\mathcal{C}_{O^{t+1} O^{t+1}}^\pi \right)^{-1} \phivarphi(o^{t+1}) </math>
 
by computing the embeddings of the prediction step via the [[#Kernel Sum Rule|kernel sum rule]] and the embedding of the conditioning step via [[#Kernel Bayes' Rule|kernel Bayes' rule]]. Assuming a training sample <math>(\widetilde{s}^1, \dots, \widetilde{s}^T, \widetilde{o}^1, \dots, \widetilde{o}^T) </math> is given, one can in practice estimate
 
:<math>\widehat{\mu}_{S^{t+1} \mid h^{t+1}} = \sum_{i=1}^T \alpha_i^t \phivarphi(\widetilde{s}^t)</math>
 
and filtering with kernel embeddings is thus implemented recursively using the following updates for the weights <math>\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_T)</math> <ref name = "Song2013"/>
Line 273:
 
=== Domain adaptation under covariate, target, and conditional shift ===
The goal of [[Domain Adaptation|___domain adaptation]] is the formulation of learning algorithms which generalize well when the training and test data have different distributions. Given training examples <math>\{(x_i^{tr}, y_i^{tr})\}_{i=1}^n</math> and a test set <math>\{(x_j^{te}, y_j^{te}) \}_{j=1}^m </math> where the <math>y_j^{te}</math> are unknown, three types of differences are commonly assumed between the distribution of the training examples <math>P^{tr}(X,Y)</math> and the test distribution <math> P^{te}(X,Y)</math>:<ref name = "DA">K. Zhang, B. Schölkopf, K. Muandet, Z. Wang. (2013). [http://jmlr.org/proceedings/papers/v28/zhang13d.pdf Domain adaptation under target and conditional shift]. ''Journal of Machine Learning Research, '''28'''(3): 819–827.</ref><ref name = "CovS">A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf. (2008). Covariate shift and local learning by distribution matching. ''In J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence (eds.). Dataset shift in machine learning'', MIT Press, Cambridge, MA: 131–160.</ref>
# '''Covariate Shift''' in which the marginal distribution of the covariates changes across domains: <math> P^{tr}(X) \neq P^{te}(X)</math>
# '''Target Shift''' in which the marginal distribution of the outputs changes across domains: <math> P^{tr}(Y) \neq P^{te}(Y)</math>
Line 280:
By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated. Covariate shift may be accounted for by reweighting examples via estimates of the ratio <math>P^{te}(X)/P^{tr}(X)</math> obtained directly from the kernel embeddings of the marginal distributions of <math>X</math> in each ___domain without any need for explicit estimation of the distributions.<ref name = "CovS"/> Target shift, which cannot be similarly dealt with since no samples from <math>Y</math> are available in the test ___domain, is accounted for by weighting training examples using the vector <math>\boldsymbol{\beta}^*(\mathbf{y}^{tr}) </math> which solves the following optimization problem (where in practice, empirical approximations must be used) <ref name = "DA"/>
 
:<math>\min_{\boldsymbol{\beta}(y)} \left \|\mathcal{C}_{{(X \mid Y)}^{tr}} \mathbb{E}_{Y^{tr}} [\boldsymbol{\beta}(y) \phivarphi(y)] - \mu_{X^{te}} \right \|_\mathcal{H}^2</math> subject to <math>\boldsymbol{\beta}(y) \ge 0, \mathbb{E}_{Y^{tr}} [\boldsymbol{\beta}(y)] = 1</math>
 
To deal with ___location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data <math> \mathbf{X}^{new} = \mathbf{X}^{tr} \odot \mathbf{W} + \mathbf{B}</math> (where <math>\odot</math> denotes the element-wise vector product). To ensure similar distributions between the new transformed training samples and the test data, <math>\mathbf{W},\mathbf{B}</math> are estimated by minimizing the following empirical kernel embedding distance <ref name = "DA"/>
Line 290:
=== Domain generalization via invariant feature representation ===
Given ''N'' sets of training examples sampled i.i.d. from distributions <math>P^{(1)}(X,Y), P^{(2)}(X,Y), \ldots, P^{(N)}(X,Y)</math>, the goal of '''___domain generalization''' is to formulate learning algorithms which perform well on test examples sampled from a previously unseen ___domain <math>P^*(X,Y)</math> where no data from the test ___domain is available at training time. If conditional distributions <math>P(Y \mid X)</math> are assumed to be relatively similar across all domains, then a learner capable of ___domain generalization must estimate a functional relationship between the variables which is robust to changes in the marginals <math>P(X)</math>. Based on kernel embeddings of these distributions, Domain Invariant Component Analysis (DICA) is a method which determines the transformation of the training data that minimizes the difference between marginal distributions while preserving a common conditional distribution shared between all training domains.<ref name = "DICA">K. Muandet, D. Balduzzi, B. Schölkopf. (2013).[http://jmlr.org/proceedings/papers/v28/muandet13.pdf Domain Generalization Via Invariant Feature Representation]. ''30th International Conference on Machine Learning''.</ref> DICA thus extracts ''invariants'', features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as [[kernel principal component analysis]], transfer component analysis, and covariance operator inverse regression.<ref name = "DICA"/>
 
Defining a probability distribution <math>\mathcal{P}</math> on the RKHS <math>\mathcal{H}</math> with
 
Line 323:
 
== Example ==
In this simple example, which is taken from Song et al.,<ref name = "Song2013"/> <math>X, Y</math> are assumed to be [[Probability distribution#Discrete probability distribution|discrete random variables]] which take values in the set <math>\{1,\ldots,K\} </math> and the kernel is chosen to be the [[Kronecker delta]] function, so <math>k(x,x') = \delta(x,x')</math>. The feature map corresponding to this kernel is the [[standard basis]] vector <math>\phivarphi(x) = \mathbf{e}_x</math>. The kernel embeddings of such a distributions are thus vectors of marginal probabilities while the embeddings of joint distributions in this setting are <math>K\times K </math> matrices specifying joint probability tables, and the explicit form of these embeddings is
 
:<math>\mu_X = \mathbb{E}_X [\mathbf{e}_X] = \begin{pmatrix} P(X=1) \\ \vdots \\ P(X=K) \\ \end{pmatrix}</math>
Line 342:
Thus, the embeddings of the conditional distribution under a fixed value of <math>X</math> may be computed as
 
:<math>\mu_{Y \mid x} = \mathcal{C}_{Y \mid X} \phivarphi(x) = \begin{pmatrix} P(Y=1 \mid X = x) \\ \vdots \\ P(Y=K \mid X = x) \\ \end{pmatrix} </math>
 
In this discrete-valued setting with the Kronecker delta kernel, the [[#Rules of probability as operations in the RKHS|kernel sum rule]] becomes