Importance sampling: Difference between revisions

Content deleted Content added
what is reffered to as "distribution” is a cumulative density function; it’s clear from the next sentence for those who know how pdf is defined based on cdf, but reader is not assumed to know this. I clarified to make it unambiguous.
Citation bot (talk | contribs)
Altered url. URLs might have been anonymized. Added bibcode. | Use this bot. Report bugs. | Suggested by Abductive | Category:Monte Carlo methods | #UCB_Category 9/65
 
(11 intermediate revisions by 9 users not shown)
Line 1:
{{short description|Distribution estimation technique}}
'''Importance sampling''' is a [[Monte Carlo method]] for evaluating properties of a particular [[probability distribution|distribution]], while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by [[Teun Kloek]] and [[Herman K. van Dijk]] in 1978,<ref>{{cite journal |first1=T. |last1=Kloek |first2=H. K. |last2=van Dijk |title=Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo |journal=[[Econometrica]] |volume=46 |issue=1 |year=1978 |pages=1–19 |doi=10.2307/1913641 |jstor=1913641 |url=https://ageconsearch.umn.edu/record/272139/files/erasmus076.pdf }}</ref> but its precursors can be found in [[Monte Carlo method in statistical physics|statistical physics]] as early as 1949.<ref>{{cite journal |first=G. |last=Goertzle |authorlink=Gerald Goertzel |title=Quota Sampling and Importance Functions in Stochastic Solution of Particle Problems |journal=Technical Report ORNL-434, Oak Ridge National Laboratory |series=Aecd ; 2793 |year=1949 |hdl=2027/mdp.39015086443671 }}</ref><ref>{{cite journal |last1=Kahn |first1=H. |authorlink=Herman Kahn |last2=Harris |first2=T. E. |authorlink2=Theodore E. Harris |year=1949 |title=Estimation of Particle Transmission by Random Sampling |journal=Monte Carlo Method |volume=12 |series=Applied Mathematics Series |pages=27–30 |publisher=National Bureau of Standards. }}</ref> Importance sampling is also related to [[umbrella sampling]] in [[computational physics]]. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.
 
== Basic theory ==
<!-- '''E'''[''X;P''] / \mathbf{E}[X;P] represents the expectation of ''X''. -->
 
Let <math>X\colon \Omega\to \mathbb{R}</math> be a [[random variable]] in some [[probability space]] <math>(\Omega,\mathcal{F},\mathbb{P})</math>. We wish to estimate the [[expected value]] of ''<math>X''</math> under ''<math>\mathbb{P''}</math>, denoted '''<math>\mathbb{E'''}_\mathbb{P}[''X;P'']</math>. If we have statistically independent random samples <math>x_1X_1, \ldots, x_nX_n</math>, generated according to ''<math>\mathbb{P''}</math>, then an empirical estimate of '''<math>\mathbb{E'''}_{\mathbb{P}}[''X;P'']</math> is just
 
: <math>
\widehat{\mathbfmathbb{E}}_{n\mathbb{P}}[X;P] = \frac{1}{n} \sum_{i=1}^n x_iX_i \quad \mathrm{where}\; x_iX_i \sim \mathbb{P}(X)
</math>
 
and the precision of this estimate depends on the variance of ''<math>X''</math>:
 
: <math>
\operatorname{var}_\mathbb{P}\big[\widehat{\mathbfmathbb{E}}_{n};\mathbb{P}}[X]\big] = \frac{\operatorname{var}_\mathbb{P}[X;P]} n.
</math>
 
The basic idea of importance sampling is to sample the states from a different distribution to lower the variance of the estimation of '''<math>\mathbb{E'''}_\mathbb{P}[''X;P'']</math>, or when sampling directly from ''<math>\mathbb{P''}</math> is difficult.
 
This is accomplished by first choosing a random variable <math>LY\geq 0</math> such that '''<math>\mathbb{E'''[''L'';''}_\mathbb{P''}[Y]&nbsp; =&nbsp; 1</math> and that ''<math>\mathbb{P''}</math>-[[almost everywhere]] <math>LY(\omega)\neq 0</math>.
With the variable ''L''<math>Y</math> we define a probability <math>P^\mathbb{(L)Q}</math> that satisfies
: <math>
\mathbfmathbb{E}_{\mathbb{P}}[X;P] = \mathbfmathbb{E}_{\mathbb{Q}}\left[\frac{X}{L};P^{(L)Y}\right].
</math>
 
The variable ''<math>X''/''L''Y</math> will thus be sampled under ''P''<supmath>(''L'')\mathbb{Q}</supmath> to estimate '''<math>\mathbb{E'''}_{\mathbb{P}}[''X;P'']</math> as above and this estimation is improved when
<math>\operatorname{var}\left[\frac{X}{L};P^{(L)}\right] < \operatorname{var}[X;P]</math>.
 
When ''X'' is of constant sign over Ω, the best variable ''L'' would clearly be <math>L^*=\frac{X}{\mathbf{E}[X;P]}\geq 0</math>, so that ''X''/''L''* is the searched constant '''E'''[''X;P''] and a single sample under ''P''<sup>(''L''*)</sup> suffices to give its value. Unfortunately we cannot take that choice, because '''E'''[''X;P''] is precisely the value we are looking for! However this theoretical best case ''L*'' gives us an insight into what importance sampling does:
 
: <math>
<math> \operatorname{var}_{\mathbb{Q}}\left[\frac{X}{L};P^{(L)Y}\right] < \operatorname{var}_{\mathbb{P}}[X;P]</math>.
\begin{align}\forall a\in\mathbb{R}, \; P^{(L^*)}(X\in[a;a+da]) &= \int_{\omega\in\{X\in[a;a+da]\}} \frac{X(\omega)}{E[X;P]} \, dP(\omega) \\[6pt] &= \frac{1}{E[X;P]}\; a\,P(X\in[a;a+da])
\end{align}</math>
 
to the right, <math>a\,P(X\in[a;a+da])</math> is one of the infinitesimal elements that sum up to '''E'''[''X'';''P'']:
When ''<math>X''</math> is of constant sign over Ω<math>\Omega</math>, the best variable ''L''<math>Y</math> would clearly be <math>LY^*=\frac{X}{\mathbfmathbb{E}_{\mathbb{P}}[X;P]}\geq 0</math>, so that ''<math>X''/''L''Y^*</math> is the searched constant '''<math>\mathbb{E'''}_{\mathbb{P}}[''X;P'']</math> and a single sample under ''P''<supmath>(''L''\mathbb{Q}^*)</supmath> suffices to give its value. Unfortunately we cannot take that choice, because '''<math>\mathbb{E'''}_{\mathbb{P}}[''X;P'']</math> is precisely the value we are looking for! However this theoretical best case ''L<math>Y^*''</math> gives us an insight into what importance sampling does: for all <math> x \in \mathbb{R}</math>, the density of <math>\mathbb{Q}^*</math> at <math>X=x</math> can be written as
: <math>\begin{align}
\begin{align}\forall a\in\mathbb{RQ}, \; P^{(L^*)}\big(X\in[ax;ax+dadx]\big) &= \int_{\omega\in\{X\in[ax;ax+dadx]\}} \frac{X(\omega)}{\mathbb{E}_{\mathbb{P}}[X;P]} \, dPd\mathbb{P}(\omega) \\[6pt] &= \frac{1}{\mathbb{E}_{\mathbb{P}}[X;P]}\; ax\,\mathbb{P}(X\in[ax;ax+dadx]) .
\end{align}
</math>
toTo the right, <math>ax\,\mathbb{P}(X\in[ax;ax+dadx])</math> is one of the infinitesimal elements that sum up to '''<math>\mathbb{E'''}_{\mathbb{P}}[''X'';''P'']</math>:
 
: <math>\mathbb{E}_{\mathbb{P}}[X;P] = \int_{a=-\infty}^{+\infty} ax\,\mathbb{P}(X\in[ax;ax+dadx]) </math>
therefore, a good probability change ''P''<supmath>(''L'')\mathbb{Q}</supmath> in importance sampling will redistribute the law of ''<math>X''</math> so that its samples' frequencies are sorted directly according to their weightscontributions in '''<math>\mathbb{E'''}_{\mathbb{P}}[''X'';'']</math> as opposed to <math>\mathbb{E}_{\mathbb{P''}}[1]</math>. Hence the name "importance sampling."
 
Importance sampling is often used as a [[Monte Carlo integration|Monte Carlo integrator]].
When <math>\mathbb{P}</math> is the uniform distribution andover <math>\Omega =\mathbb{R}</math>, '''the expectation <math>\mathbb{E'''}_{\mathbb{P}}[''X;P'']</math> corresponds to the integral of the real function <math>X\colon \mathbb{R}\to\mathbb{R}</math>.
 
== Application to probabilistic inference ==
 
Such methods are frequently used to estimate posterior densities or expectations in state and/or parameter estimation problems in probabilistic models that are too hard to treat analytically. Examples include [[Bayesian network]]s and importance weighted [[Variationalvariational autoencoder|variational autoencoders]]s.<ref>{{Cite journal |lastlast1=Burda |firstfirst1=Yuri |last2=Grosse |first2=Roger |last3=Salakhutdinov |first3=Ruslan |title=Importance Weighted Autoencoders |url=https://arxiv.org/abs/1509.00519 |journal=Proceedings of the 4th International Conference on Learning Representations (ICLR) |arxiv=1509.00519 |publication-date=2016}}</ref>
 
== Application to simulation ==
Line 63 ⟶ 66:
=== Mathematical approach ===
 
Consider estimating by simulation the probability <math>p_t\,</math> of an event <math>X \ge t</math>, where <math>X</math> is a random variable with [[cumulative density function | cumulative densitydistribution function]] <math>F(x)</math> and [[probability density function]] <math>f(x)= F'(x)\,</math>, where prime denotes [[derivative]]. A <math>K</math>-length [[independent and identically distributed]] (i.i.d.) sequence <math>X_i\,</math> is generated from the distribution <math>F</math>, and the number <math>k_t</math> of random variables that lie above the threshold <math>t</math> are counted. The random variable <math>k_t</math> is characterized by the [[Binomial distribution]]
 
:<math>P(k_t = k)={K\choose k}p_t^k(1-p_t)^{K-k},\,\quad \quad k=0,1,\dots,K.</math>
 
One can show that <math>\operatornamemathbb{E} [k_t/K] = p_t</math>, and <math>\operatorname{var} [k_t/K] = p_t(1-p_t)/K</math>, so in the limit <math>K \to \infty</math> we are able to obtain <math>p_t</math>. Note that the variance is low if <math>p_t \approx 1</math>. Importance sampling is concerned with the determination and use of an alternate density function <math>f_*\,</math>(for <math>X</math>), usually referred to as a biasing density, for the simulation experiment. This density allows the event <math>{ X \ge t\ }</math> to occur more frequently, so the sequence lengths <math>K</math> gets smaller for a given [[estimator]] variance. Alternatively, for a given <math>K</math>, use of the biasing density results in a variance smaller than that of the conventional Monte Carlo estimate. From the definition of <math>p_t\,</math>, we can introduce <math>f_*\,</math> as below.
 
:<math>
\begin{align}
p_t & = \mathbb{E} [1(1_{\{X \ge t)\}}] \\[6pt]
& = \int 1(1_{\{x \ge t)\}} \frac{f(x)}{f_*(x)} f_*(x) \,dx \\[6pt]
& = E_\mathbb{E}_* [1(1_{\{X \ge t)\}} W(X)]
\end{align}
</math>
Line 83 ⟶ 86:
is a likelihood ratio and is referred to as the weighting function. The last equality in the above equation motivates the estimator
 
:<math> \hat p_t = \frac{1}{K}\,\sum_{i=1}^K 1(1_{\{X_i \ge t)\}} W(X_i),\,\quad \quad X_i \sim f_*</math>
 
This is the importance sampling estimator of <math>p_t\,</math> and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from <math>f_*\,</math> and for each sample which exceeds <math>t\,</math>, the estimate is incremented by the weight <math>W\,</math> evaluated at the sample value. The results are averaged over <math>K\,</math> trials. The variance of the importance sampling estimator is easily shown to be
Line 89 ⟶ 92:
:<math>
\begin{align}
\operatorname{var}_*\widehat p_t & = \frac{1}{K}\operatorname{var}_* [1(X1_{\{X_i \ge t)\}}W(X)] \\[5pt]
& = \frac{1}{K}\left\{\mathbb{E_*E}_*[1(X1_{\{X_i \ge t)\}}^2 W^2(X)] - p_t^2\right\} \\[5pt]
& = \frac{1}{K}\left\{\mathbb{E}[1(X1_{\{X_i \ge t) \}}W(X)] - p_t^2\right\}
\end{align}
</math>
Line 136 ⟶ 139:
=== Evaluation of importance sampling ===
 
In order to identify successful importance sampling techniques, it is useful to be able to quantify the run-time savings due to the use of the importance sampling approach. The performance measure commonly used is <math>\sigma^2_{MC} / \sigma^2_{IS} \,</math>, and this can be interpreted as the speed-up factor by which the importance sampling estimator achieves the same precision as the MC estimator. This has to be computed empirically since the estimator variances are not likely to be analytically possible when their mean is intractable. Other useful concepts in quantifying an importance sampling estimator are the variance bounds and the notion of asymptotic efficiency. One related measure is the so-called '''Effective Sample Size''' '''(ESS)'''.<ref>{{Cite journal|last1=Martino|first1=Luca|last2=Elvira|first2=Víctor|last3=Louzada|first3=Francisco|title=Effective sample size for importance sampling based on discrepancy measures|journal=Signal Processing|volume=131|pages=386–401|doi=10.1016/j.sigpro.2016.08.025|arxiv=1602.03572|year=2017|bibcode=2017SigPr.131..386M |s2cid=26317735 }}</ref>
 
=== Variance cost function ===
Line 145 ⟶ 148:
 
===Multiple and adaptive importance sampling ===
When different proposal distributions, <math>g_ng_i(x)</math> , <math>ni=1,\ldots,Nn,</math> are jointly used for drawing the samples <math>x_1, \ldots, x_Nx_n, </math> different proper weighting functions can be employed (e.g., see <ref>{{Cite book|publisher = ACM|date = 1995-01-01|___location = New York, NY, USA|isbn = 978-0-89791-701-8|pages = [https://archive.org/details/computergraphics00sigg/page/419 419–428]|series = SIGGRAPH '95|doi = 10.1145/218380.218498|first1 = Eric|last1 = Veach|first2 = Leonidas J.|last2 = Guibas| title=Proceedings of the 22nd annual conference on Computer graphics and interactive techniques - SIGGRAPH '95 | chapter=Optimally combining sampling techniques for Monte Carlo rendering |citeseerx = 10.1.1.127.8105| s2cid=207194026 |chapter-url = https://archive.org/details/computergraphics00sigg/page/419}}</ref><ref>{{Cite journal|title = Safe and Effective Importance Sampling|journal = Journal of the American Statistical Association|date = 2000-03-01|issn = 0162-1459|pages = 135–143|volume = 95|issue = 449|doi = 10.1080/01621459.2000.10473909|first1 = Art|last1 = Owen|first2 = Yi Zhou|last2 = Associate|citeseerx = 10.1.1.36.4536| s2cid=119761472 }}</ref><ref>{{Cite journal|title = Efficient Multiple Importance Sampling Estimators|journal = IEEE Signal Processing Letters|date = 2015-10-01|issn = 1070-9908|pages = 1757–1761|volume = 22|issue = 10|doi = 10.1109/LSP.2015.2432078|first1 = V.|last1 = Elvira|first2 = L.|last2 = Martino|first3 = D.|last3 = Luengo|first4 = M.F.|last4 = Bugallo|arxiv = 1505.05391|bibcode = 2015ISPL...22.1757E| s2cid=14504598 }}</ref><ref>{{Cite journal|last1=Elvira|first1=Víctor|last2=Martino|first2=Luca|last3=Luengo|first3=David|last4=Bugallo|first4=Mónica F.|title=Improving population Monte Carlo: Alternative weighting and resampling schemes|journal=Signal Processing|volume=131|pages=77–91|doi=10.1016/j.sigpro.2016.07.012|arxiv=1607.02758|year=2017|bibcode=2017SigPr.131...77E |s2cid=205171823 }}</ref>). In an adaptive setting, the proposal distributions, <math>g_{ni,t}(x)</math> , <math>ni=1,\ldots,Nn,</math> and <math>t=1,\ldots,T,</math> are updated each iteration <math>t</math> of the adaptive importance sampling algorithm. Hence, since a population of proposal densities is used, several suitable combinations of sampling and weighting schemes can be employed.<ref>{{Cite journal|title = Population Monte Carlo|journal = Journal of Computational and Graphical Statistics|date = 2004-12-01|issn = 1061-8600|pages = 907–929|volume = 13|issue = 4|doi = 10.1198/106186004X12803|first1 = O.|last1 = Cappé|first2 = A.|last2 = Guillin|first3 = J. M.|last3 = Marin|first4 = C. P.|last4 = Robert| s2cid=119690181 }}</ref><ref>{{Cite journal|last1=Martino|first1=L.|last2=Elvira|first2=V.|last3=Luengo|first3=D.|last4=Corander|first4=J.|date=2017-05-01|title=Layered adaptive importance sampling|journal=Statistics and Computing|language=en|volume=27|issue=3|pages=599–623|doi=10.1007/s11222-016-9642-5|issn=0960-3174|arxiv=1505.04732|s2cid=2508031 }}</ref><ref>{{Cite journal|title = Adaptive importance sampling in general mixture classes|journal = Statistics and Computing|date = 2008-04-25|issn = 0960-3174|pages = 447–459|volume = 18|issue = 4|doi = 10.1007/s11222-008-9059-x|first1 = Olivier|last1 = Cappé|first2 = Randal|last2 = Douc|first3 = Arnaud|last3 = Guillin|first4 = Jean-Michel|last4 = Marin|first5 = Christian P.|last5 = Robert|arxiv = 0710.4242| s2cid=483916 }}</ref><ref>{{Cite journal|title = Adaptive Multiple Importance Sampling|journal = Scandinavian Journal of Statistics|date = 2012-12-01|issn = 1467-9469|pages = 798–812|volume = 39|issue = 4|doi = 10.1111/j.1467-9469.2011.00756.x|first1 = Jean-Marie|last1 = Cornuet|first2 = Jean-Michel|last2 = Marin|first3 = Antonietta|last3 = Mira|author3-link=Antonietta Mira|first4 = Christian P.|last4 = Robert|arxiv = 0907.1254| s2cid=17191248 }}</ref><ref>{{Cite journal|title = An Adaptive Population Importance Sampler: Learning From Uncertainty|journal = IEEE Transactions on Signal Processing|date = 2015-08-01|issn = 1053-587X|pages = 4422–4437|volume = 63|issue = 16|doi = 10.1109/TSP.2015.2440215|first1 = L.|last1 = Martino|first2 = V.|last2 = Elvira|first3 = D.|last3 = Luengo|first4 = J.|last4 = Corander|bibcode = 2015ITSP...63.4422M|citeseerx = 10.1.1.464.9395| s2cid=17017431 }}</ref><ref>{{Cite journal|title = Adaptive importance sampling in signal processing|journal = Digital Signal Processing|date = 2015-12-01|pages = 36–49|volume = 47|series = Special Issue in Honour of William J. (Bill) Fitzgerald|doi = 10.1016/j.dsp.2015.05.014|first1 = Mónica F.|last1 = Bugallo|first2 = Luca|last2 = Martino|first3 = Jukka|last3 = Corander| bibcode=2015DSP....47...36B |doi-access = free}}</ref><ref>{{Cite journal|last1=Bugallo|first1=M. F.|last2=Elvira|first2=V.|last3=Martino|first3=L.|last4=Luengo|first4=D.|last5=Miguez|first5=J.|last6=Djuric|first6=P. M.|date=July 2017|title=Adaptive Importance Sampling: The past, the present, and the future|journal=IEEE Signal Processing Magazine|volume=34|issue=4|pages=60–79|doi=10.1109/msp.2017.2699226|issn=1053-5888|bibcode=2017ISPM...34...60B|s2cid=5619054 }}</ref>
 
== See also ==
Line 166 ⟶ 169:
*{{cite book |title=Sequential Monte Carlo Methods in Practice |first1=A. |last1=Doucet |first2=N. |last2=de Freitas |first3=N. |last3=Gordon |publisher=Springer |year=2001 |isbn=978-0-387-95146-1 }}
*{{cite book |first1=M. |last1=Ferrari |first2=S. |last2=Bellini |title=ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240) |chapter=Importance sampling simulation of turbo product codes |volume=9 |pages=2773–2777 |year=2001 |doi=10.1109/ICC.2001.936655 |isbn=978-0-7803-7097-5 |s2cid=5158473 }}
*{{cite journal |first=Oleg |last=Mazonka |title=Easy as Pi: The Importance Sampling Method |journal=Journal of Reference |volume=16 |year=2016 |url=httphttps://jrxvwww.researchgate.net/xpublication/16/ism.pdf303859967}}
*{{cite book |first=Tommy |last=Oberg |title=Modulation, Detection, and Coding |publisher=John Wiley & Sons |___location=New York |year=2001 }}
*{{Cite book | last1=Press | first1=WH | last2=Teukolsky | first2=SA | last3=Vetterling | first3=WT | last4=Flannery | first4=BP | year=2007 | title=Numerical Recipes: The Art of Scientific Computing | edition=3rd | publisher=Cambridge University Press | ___location=New York | isbn=978-0-521-88068-8 | chapter=Section 7.9.1 Importance Sampling | chapter-url=http://apps.nrbook.com/empanel/index.html#pg=411 | access-date=2011-08-12 | archive-date=2011-08-11 | archive-url=https://web.archive.org/web/20110811154417/http://apps.nrbook.com/empanel/index.html#pg=411 | url-status=dead }}
*{{cite book |first=B. D. |last=Ripley |title=Stochastic Simulation |year=1987 |publisher=Wiley & Sons }}
*{{cite journal |first1=P. J. |last1=Smith |first2=M. |last2=Shafi |first3=H. |last3=Gao |title=Quick simulation: A review of importance sampling techniques in communication systems |journal= IEEE Journal on Selected Areas in Communications |volume=15 |issue=4 |pages=597–613 |year=1997 |doi=10.1109/49.585771 }}