Kernel embedding of distributions: Difference between revisions

Content deleted Content added
Line 289:
 
=== Domain generalization via invariant feature representation ===
Given ''N'' sets of training examples sampled i.i.d. from distributions <math>P^{(1)}(X,Y), P^{(2)}(X,Y), \dotsldots, P^{(N)}(X,Y)</math>, the goal of '''___domain generalization''' is to formulate learning algorithms which perform well on test examples sampled from a previously unseen ___domain <math>P^*(X,Y)</math> where no data from the test ___domain is available at training time. If conditional distributions <math>P(Y \mid X)</math> are assumed to be relatively similar across all domains, then a learner capable of ___domain generalization must estimate a functional relationship between the variables which is robust to changes in the marginals <math>P(X)</math>. Based on kernel embeddings of these distributions, Domain Invariant Component Analysis (DICA) is a method which determines the transformation of the training data that minimizes the difference between marginal distributions while preserving a common conditional distribution shared between all training domains.<ref name = "DICA">K. Muandet, D. Balduzzi, B. Schölkopf. (2013).[http://jmlr.org/proceedings/papers/v28/muandet13.pdf Domain Generalization Via Invariant Feature Representation]. ''30th International Conference on Machine Learning''.</ref> DICA thus extracts ''invariants'', features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as [[kernel principal component analysis]], transfer component analysis, and covariance operator inverse regression.<ref name = "DICA"/>
Defining a probability distribution <math>\mathcal{P}</math> on the RKHS <math>\mathcal{H}</math> with <math>\mathcal{P}(\mu_{X^{(i)}Y^{(i)}}) = 1/N \text{ for } i=1,\dots, N</math>, DICA measures dissimilarity between domains via '''distributional variance''' which is computed as
 
:<math>V_\mathcal{HP} \left (\mathcalmu_{P}X^{(i) = \frac{1}Y^{N(i)}} \text{tr}(\mathbf{G}right ) -= \frac{1}{N^2} \sum_{i,j=1}^Nqquad \mathbftext{G}_{ij} </math>for where <math>\mathbf{G}_{ij} i= 1,\langledots, \mu_{X^{(i)}}N, \mu_{X^{(j)}} \rangle_\mathcal{H} </math>
 
DICA measures dissimilarity between domains via '''distributional variance''' which is computed as
 
:<math>V_\mathcal{H} (\mathcal{P}) = \frac{1}{N} \text{tr}(\mathbf{G}) - \frac{1}{N^2} \sum_{i,j=1}^N \mathbf{G}_{ij} </math>
 
where
 
:<math>\mathbf{G}_{ij} = \left \langle \mu_{X^{(i)}}, \mu_{X^{(j)}} \right \rangle_\mathcal{H} </math>
 
so <math>\mathbf{G}</math> is a <math>N \times N</math> Gram matrix over the distributions from which the training data are sampled. Finding an [[Orthogonal matrix|orthogonal transform]] onto a low-dimensional [[Linear subspace|subspace]] ''B'' (in the feature space) which minimizes the distributional variance, DICA simultaneously ensures that ''B'' aligns with the [[Basis function|bases]] of a '''central subspace''' ''C'' for which <math>Y</math> becomes independent of <math>X</math> given <math>C^T X</math> across all domains. In the absence of target values <math>Y</math>, an unsupervised version of DICA may be formulated which finds a low-dimensional subspace that minimizes distributional variance while simultaneously maximizing the variance of <math>X</math> (in the feature space) across all domains (rather than preserving a central subspace).<ref name = "DICA"/>