Revision as of 09:51, 17 September 2019 edit Latex-yow (talk \| contribs) Extended confirmed users 1,289 edits m →Definitions ← Previous edit		Revision as of 22:35, 19 September 2019 edit undo Latex-yow (talk \| contribs) Extended confirmed users 1,289 edits m →Domain generalization via invariant feature representation Next edit →
Line 289: === Domain generalization via invariant feature representation === Given ''N'' sets of training examples sampled i.i.d. from distributions <math>P^{(1)}(X,Y), P^{(2)}(X,Y), \~~dots~~ldots, P^{(N)}(X,Y)</math>, the goal of '''___domain generalization''' is to formulate learning algorithms which perform well on test examples sampled from a previously unseen ___domain <math>P^*(X,Y)</math> where no data from the test ___domain is available at training time. If conditional distributions <math>P(Y \mid X)</math> are assumed to be relatively similar across all domains, then a learner capable of ___domain generalization must estimate a functional relationship between the variables which is robust to changes in the marginals <math>P(X)</math>. Based on kernel embeddings of these distributions, Domain Invariant Component Analysis (DICA) is a method which determines the transformation of the training data that minimizes the difference between marginal distributions while preserving a common conditional distribution shared between all training domains.<ref name = "DICA">K. Muandet, D. Balduzzi, B. Schölkopf. (2013).[http://jmlr.org/proceedings/papers/v28/muandet13.pdf Domain Generalization Via Invariant Feature Representation]. ''30th International Conference on Machine Learning''.</ref> DICA thus extracts ''invariants'', features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as [[kernel principal component analysis]], transfer component analysis, and covariance operator inverse regression.<ref name = "DICA"/> Defining a probability distribution <math>\mathcal{P}</math> on the RKHS <math>\mathcal{H}</math> with ~~<math>\mathcal{P}(\mu_{X^{(i)}Y^{(i)}}) = 1/N \text{ for } i=1,\dots, N</math>, DICA measures dissimilarity between domains via '''distributional variance''' which is computed as~~ :<math>V_\mathcal{HP} \left (\~~mathcal~~mu_{P}X^{(i) ~~= \frac{1~~}Y^{N(i)}} \~~text{tr}(\mathbf{G}~~right ) -= \frac{1}{N^2} \~~sum_{i,j=1}^N~~qquad \~~mathbf~~text{~~G}_{ij}~~ ~~</math>~~for ~~where <math>\mathbf{G}_{ij~~} i= 1,\~~langle~~dots, ~~\mu_{X^{(i)}}~~N, ~~\mu_{X^{(j)}} \rangle_\mathcal{H}~~ </math> DICA measures dissimilarity between domains via '''distributional variance''' which is computed as :<math>V_\mathcal{H} (\mathcal{P}) = \frac{1}{N} \text{tr}(\mathbf{G}) - \frac{1}{N^2} \sum_{i,j=1}^N \mathbf{G}_{ij} </math> where :<math>\mathbf{G}_{ij} = \left \langle \mu_{X^{(i)}}, \mu_{X^{(j)}} \right \rangle_\mathcal{H} </math> so <math>\mathbf{G}</math> is a <math>N \times N</math> Gram matrix over the distributions from which the training data are sampled. Finding an [[Orthogonal matrix\|orthogonal transform]] onto a low-dimensional [[Linear subspace\|subspace]] ''B'' (in the feature space) which minimizes the distributional variance, DICA simultaneously ensures that ''B'' aligns with the [[Basis function\|bases]] of a '''central subspace''' ''C'' for which <math>Y</math> becomes independent of <math>X</math> given <math>C^T X</math> across all domains. In the absence of target values <math>Y</math>, an unsupervised version of DICA may be formulated which finds a low-dimensional subspace that minimizes distributional variance while simultaneously maximizing the variance of <math>X</math> (in the feature space) across all domains (rather than preserving a central subspace).<ref name = "DICA"/>

Kernel embedding of distributions: Difference between revisions