Revision as of 18:52, 27 March 2020 edit Michael Hardy (talk \| contribs) Administrators 210,597 edits →Measuring distance between distributions ← Previous edit		Revision as of 18:52, 27 March 2020 edit undo Michael Hardy (talk \| contribs) Administrators 210,597 edits No edit summary Next edit →
Line 219: Although learning algorithms in the kernel embedding framework circumvent the need for intermediate density estimation, one may nonetheless use the empirical embedding to perform density estimation based on ''n'' samples drawn from an underlying distribution <math>P_X^*</math>. This can be done by solving the following optimization problem <ref name ="SongThesis"/><ref>M. Dudík, S. J. Phillips, R. E. Schapire. (2007). [http://classes.soe.ucsc.edu/cmps242/Winter08/lect/15/maxent_genreg_jmlr.pdf Maximum Entropy Distribution Estimation with Generalized Regularization and an Application to Species Distribution Modeling]. ''Journal of Machine Learning Research'', '''8''': 1217–1260.</ref> :<math> \max_{P_X} H(P_X) </math> subject to <math>\\|\widehat{\mu}_X - \mu_X[P_X] \\|_\mathcal{H} \le \~~epsilon~~varepsilon</math> where the maximization is done over the entire space of distributions on <math>\Omega.</math> Here, <math>\mu_X[P_X]</math> is the kernel embedding of the proposed density <math>P_X</math> and <math>H</math> is an entropy-like quantity (e.g. [[Entropy (information theory)\|Entropy]], [[Kullback–Leibler divergence\|KL divergence]], [[Bregman divergence]]). The distribution which solves this optimization may be interpreted as a compromise between fitting the empirical kernel means of the samples well, while still allocating a substantial portion of the probability mass to all regions of the probability space (much of which may not be represented in the training examples). In practice, a good approximate solution of the difficult optimization may be found by restricting the space of candidate densities to a mixture of ''M'' candidate distributions with regularized mixing proportions. Connections between the ideas underlying [[Gaussian process]]es and [[conditional random fields]] may be drawn with the estimation of conditional probability distributions in this fashion, if one views the feature mappings associated with the kernel as sufficient statistics in generalized (possibly infinite-dimensional) [[exponential family\|exponential families]].<ref name = "SongThesis"/>

Kernel embedding of distributions: Difference between revisions