Kernel embedding of distributions: Difference between revisions

Content deleted Content added
Line 313:
In distribution regression, the goal is to regress from probability distributions to reals (or vectors). Many important [[machine learning]] and statistical tasks fit into this framework, including [[Multiple-instance learning|multi-instance learning]], and [[point estimation]] problems without analytical solution (such as [[hyperparameter]] or [[entropy estimation]]). In practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between ''sets of points''. Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.<ref name = "MERR">Z. Szabó, B. Sriperumbudur, B. Póczos, A. Gretton. [http://jmlr.org/papers/v17/14-510.html Learning Theory for Distribution Regression]. ''Journal of Machine Learning Research'', 17(152):1–40, 2016.</ref>
 
Given <math>{\left(\{X_{i,n}\}_{n=1}^{N_i}, y_i\right)}_{i=1}^{\ell}</math> training data, where the <math>\hat{X_i} := \{X_{i,n}\}_{n=1}^{N_i}</math> bag contains samples from a probability distribution <math>X_i</math> and the <math>i^\text{th}</math> output label is <math>y_i\in \R</math>, one can tackle the distribution regression task by taking the embeddings of the distributions, and learning the regressor from the embeddings to the outputs. In other words, one can consider the following kernel [[Tikhonov regularization|ridge regression]] problem <math>(\lambda>0)</math>
 
:<math>J(f) = \frac{1}{\ell} \sum_{i=1}^{\ell} \left[f\left(\mu_{\hat{X_i}}\right)-y_i\right]^2 + \lambda \|f\|_{\mathcal{H}(K)}^2 \to \min_{f\in \mathcal{H}(K)}, </math>
 
where
Line 325:
The prediction on a new distribution <math>(\hat{X})</math> takes the simple, analytical form
:: <math> \hat{y}\big(\hat{X}\big) = \mathbf{k} [\mathbf{G} + \lambda \ell]^{-1}\mathbf{y}, </math>
where <math>\mathbf{k}=\big[K \big(\mu_{\hat{X}_i},\mu_{\hat{X}}\big)\big]\in \R^{1\times \ell}</math>, <math>\mathbf{G}=[G_{ij}]\in \R^{\ell\times \ell}</math>, <math>G_{ij} = K\big(\mu_{\hat{X}_i},\mu_{\hat{X}_j}\big)\in \R</math>, <math>\mathbf{y}=[y_1;...\ldots;y_ly_\ell]\in \R^{\ell}</math>. Under mild regularity conditions this estimator can be shown to be consistent and it can achieve the one-stage sampled (as if one had access to the true <math>X_i</math>-s) [[Minimax estimator|minimax optimal]] rate.<ref name = "MERR" /> In the <math>J</math> objective function <math>y_i</math>-s are real numbers; the results can also be extended to the case when <math>y_i</math>-s are <math>d</math>-dimensional vectors, or more generally elements of a [[Separable space|separable]] [[Hilbert space]] using operator-valued <math>K</math> kernels.
 
== Example ==