Content deleted Content added
No edit summary |
|||
Line 278:
=== Domain adaptation under covariate, target, and conditional shift ===
The goal of [[___domain adaptation]] is the formulation of learning algorithms which generalize well when the training and test data have different distributions. Given training examples <math>\{(x_i^\text{tr}, y_i^\text{tr})\}_{i=1}^n</math> and a test set <math>\{(x_j^\text{te}, y_j^\text{te}) \}_{j=1}^m </math> where the <math>y_j^\text{te}</math> are unknown, three types of differences are commonly assumed between the distribution of the training examples <math>P^\text{tr}(X,Y)</math> and the test distribution <math> P^\text{te}(X,Y)</math>:<ref name = "DA">K. Zhang, B. Schölkopf, K. Muandet, Z. Wang. (2013). [http://jmlr.org/proceedings/papers/v28/zhang13d.pdf Domain adaptation under target and conditional shift]. ''Journal of Machine Learning Research, '''28'''(3): 819–827.</ref><ref name = "CovS">A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf. (2008). Covariate shift and local learning by distribution matching. ''In J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence (eds.). Dataset shift in machine learning'', MIT Press, Cambridge, MA: 131–160.</ref>
# '''Covariate shift''' in which the marginal distribution of the covariates changes across domains: <math> P^\text{tr}(X) \neq P^\text{te}(X)</math>
# '''Target shift''' in which the marginal distribution of the outputs changes across domains: <math> P^\text{tr}(Y) \neq P^\text{te}(Y)</math>
# '''Conditional shift''' in which <math>P(Y)</math> remains the same across domains, but the conditional distributions differ: <math>P^\text{tr}(X \mid Y) \neq P^\text{te}(X \mid Y)</math>. In general, the presence of conditional shift leads to an [[Well-posed problem|ill-posed]] problem, and the additional assumption that <math>P(X \mid Y)</math> changes only under [[Location parameter|___location]]-[[Scale parameter|scale]] (LS) transformations on <math> X </math> is commonly imposed to make the problem tractable.
By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated. Covariate shift may be accounted for by reweighting examples via estimates of the ratio <math>P^\text{te}(X)/P^\text{tr}(X)</math> obtained directly from the kernel embeddings of the marginal distributions of <math>X</math> in each ___domain without any need for explicit estimation of the distributions.<ref name = "CovS"/> Target shift, which cannot be similarly dealt with since no samples from <math>Y</math> are available in the test ___domain, is accounted for by weighting training examples using the vector <math>\boldsymbol{\beta}^*(\mathbf{y}^\text{tr}) </math> which solves the following optimization problem (where in practice, empirical approximations must be used) <ref name = "DA"/>
:<math>\min_{\boldsymbol{\beta}(y)} \left \|\mathcal{C}_{{(X \mid Y)}^\text{tr}} \mathbb{E}_{Y^\text{tr}} [\boldsymbol{\beta}(y) \varphi(y)] - \mu_{X^\text{te}} \right \|_\mathcal{H}^2</math> subject to <math>\boldsymbol{\beta}(y) \ge 0, \mathbb{E}_{Y^\text{tr}} [\boldsymbol{\beta}(y)] = 1</math>
To deal with ___location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data <math> \mathbf{X}^\text{new} = \mathbf{X}^\text{tr} \odot \mathbf{W} + \mathbf{B}</math> (where <math>\odot</math> denotes the element-wise vector product). To ensure similar distributions between the new transformed training samples and the test data, <math>\mathbf{W},\mathbf{B}</math> are estimated by minimizing the following empirical kernel embedding distance <ref name = "DA"/>
:<math>\left \| \widehat{\mu}_{X^\text{new}} - \widehat{\mu}_{X^\text{te}} \right \|_{\mathcal{H}}^2 = \left \| \widehat{\mathcal{C}}_{(X \mid Y)^\text{new}} \widehat{\mu}_{Y^\text{tr}} - \widehat{\mu}_{X^\text{te}} \right \|_{\mathcal{H}}^2 </math>
In general, the kernel embedding methods for dealing with LS conditional shift and target shift may be combined to find a reweighted transformation of the training data which mimics the test distribution, and these methods may perform well even in the presence of conditional shifts other than ___location-scale changes.<ref name = "DA"/>
|