Content deleted Content added
InfCompact (talk | contribs) m →Use in numerical integration: Can't integrate arbitrary functions, rather only measurable functions. |
Citation bot (talk | contribs) Added article-number. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Abductive | Category:Monte Carlo methods | #UCB_Category 35/65 |
||
(86 intermediate revisions by 55 users not shown) | |||
Line 1:
{{short description|Monte Carlo algorithm}}
[[File:Flowchart-of-Metropolis-Hastings-M-H-algorithm-for-the-parameter-estimation-using-the.png|thumb|300px|The Metropolis-Hastings algorithm sampling a [[Normal distribution|normal]] one-dimensional [[Posterior probability|posterior]] probability distribution.]]
In [[statistics]] and [[statistical physics]], the '''Metropolis–Hastings algorithm''' is a [[Markov chain Monte Carlo]] (MCMC) method for obtaining a sequence of [[pseudo-random number sampling|random samples]] from a [[probability distribution]] from which direct sampling is difficult.
Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. ==History==
The algorithm
Some controversy exists with regard to credit for development of the Metropolis algorithm.
This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 article worked together for "days (and nights)".<ref name=Teller/> In contrast, the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of [[statistical mechanics]] and take ensemble averages instead of following detailed [[kinematics]]".
==
The Metropolis–Hastings algorithm can draw samples from any [[probability distribution]] with [[probability density]] <math>P(x)</math>, provided that we know a function <math>f(x)</math> proportional to the [[Probability density function|density]]
The Metropolis–Hastings algorithm
The method used to propose new candidates is characterized by the probability distribution <math>g(x\mid y)</math> (sometimes written <math>Q(x\mid y)</math>) of a new proposed sample <math>x</math> given the previous sample <math>y</math>. This is called the ''proposal density'', ''proposal function'', or ''jumping distribution''. A common choice for <math>g(x\mid y)</math> is a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited next, making the sequence of samples into a [[Gaussian random walk]]. In the original paper by Metropolis et al. (1953), <math>g(x\mid y)</math> was suggested to be a uniform distribution limited to some maximum distance from <math>y</math>. More complicated proposal functions are also possible, such as those of [[Hamiltonian Monte Carlo]], [[Langevin Monte Carlo]], or [[preconditioned Crank–Nicolson]].
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.
<!---The sample values are linked in a [[Markov chain]], which means that the probability of each sample is conditionally independent of any earlier sample, given the sample immediately before it. In other words,
general idea is to generate a sequence of samples which are linked in a [[Markov chain]]; in other words, where each sample in the sequence is conditionally independent of any earlier sample, given the sample immediately before it. The procedure for choosing successive samples guarantees that the distribution of sample values will match the desired distribution
'''Metropolis algorithm (symmetric proposal distribution)'''▼
Let <math>f(x)</math> be a function that is proportional to the desired probability density function <math>P(x)</math> (a.k.a. a target distribution){{efn|In the original paper by Metropolis et al. (1953), <math>f</math> was taken to be the [[Boltzmann distribution]] as the specific application considered was [[Monte Carlo integration]] of [[equation of state|equations of state]] in [[physical chemistry]]; the extension by Hastings generalized to an arbitrary distribution <math>f</math>.}}.
# Initialization: Choose an arbitrary point <math>x_t</math> to be the first observation in the sample and choose a proposal function <math>g(x\mid y)</math>. In this section, <math>g</math> is assumed to be symmetric; in other words, it must satisfy <math>g(x\mid y) = g(y\mid x)</math>.
# For each iteration ''t'':
#* ''
#* ''Calculate'' the ''acceptance ratio'' <math>\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate{{efn|In the original paper by Metropolis et al. (1953), <math>f</math> was actually the [[Boltzmann distribution]], as it was applied to physical systems in the context of [[statistical mechanics]] (e.g., a maximal-entropy distribution of microstates for a given temperature at thermal equilibrium). Consequently, the acceptance ratio was itself an exponential of the difference in the parameters of the numerator and denominator of this ratio.}}. Because ''f'' is proportional to the density of ''P'', we have that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>.
#* ''Accept or reject'':
#** Generate a uniform random number <math>u \in [0, 1]</math>.
Line 35:
#** If <math>u > \alpha</math>, then ''reject'' the candidate and set <math>x_{t+1} = x_t</math> instead.
This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. <math>P(x)</math> at specific point <math>x</math> is proportional to the iterations spent on the point by the algorithm. Note that the acceptance ratio <math>\alpha</math> indicates how probable the new proposed sample is with respect to the current sample, according to the distribution whose density is <math>P(x)</math>.
Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal |
* The samples are
* Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density.
On the other hand, most simple [[rejection sampling]] methods suffer from the "[[curse of dimensionality]]", where the probability of rejection increases exponentially as a function of the number of dimensions.
In [[multivariate distribution|multivariate]] distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point.
==Formal derivation==
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution <math>P(x)</math>. To accomplish this, the algorithm uses a [[Markov process]], which asymptotically reaches a unique [[Markov chain#Steady-state analysis and limiting distributions|stationary distribution]] <math>\pi(x)</math> such that <math>\pi(x) = P(x)</math>.<ref name=Roberts_Casella/>
A Markov process is uniquely defined by its transition probabilities <math>P(x'
# ''Existence of stationary distribution'': there must exist a stationary distribution <math>\pi(x)</math>. A sufficient but not necessary condition is [[
# ''Uniqueness of stationary distribution'': the stationary distribution <math>\pi(x)</math> must be unique. This is guaranteed by [[Markov Chain#Ergodicity|ergodicity]] of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite.
The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition probabilities) that fulfills the two above conditions, such that its stationary distribution <math>\pi(x)</math> is chosen to be <math>P(x)</math>. The derivation of the algorithm starts with the condition of [[detailed balance]]:
: <math>P(x'
which is re-written as
: <math>\frac{P(x'
The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The
: <math>P(x'
Inserting this relation in the previous equation, we have
: <math>\frac{A(x', x)}{A(x, x')} = \frac{P(x')}{P(x)}\frac{g(x
The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One common choice is the Metropolis choice:
: <math>A(x', x) = \min\left(1, \frac{P(x')}{P(x)} \frac{g(x
For this Metropolis acceptance ratio <math>A</math>, either <math>A(x', x) = 1</math> or <math>A(x, x') = 1</math> and, either way, the condition is satisfied.
The Metropolis–Hastings algorithm can thus
# Initialise
## Pick an initial state <math>x_0</math>.
## Set <math>t = 0</math>.
# Iterate
##
## ''Calculate'' the acceptance probability <math>A(x', x_t) = \min\left(1, \frac{P(x')}{P(x_t)} \frac{g(x_t
## ''Accept or reject'':
### generate a uniform random number <math>u \in [0, 1]</math>;
Line 87:
## ''Increment'': set <math>t = t + 1</math>.
Provided that specified conditions are met, the empirical distribution of saved states <math>x_0, \ldots, x_T</math> will approach <math>P(x)</math>. The number of iterations (<math>T</math>) required to effectively estimate <math>P(x)</math> depends on the number of factors, including the relationship between <math>P(x)</math> and the proposal distribution and the desired accuracy of estimation.<ref>Raftery, Adrian E., and Steven Lewis. "How Many Iterations in the Gibbs Sampler?" ''In Bayesian Statistics 4''. 1992.</ref>
It is important to notice that it is not clear, in a general problem, which distribution <math>g(x' | x)</math> one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand.▼
▲It is important to notice that it is not clear, in a general problem, which distribution <math>g(x'
== Use in numerical integration ==▼
{{main|Monte Carlo integration}}
Line 104 ⟶ 103:
: <math>
P(E) = \int_\Omega P(E
</math>
and, thus, estimating <math>P(E)</math> can be accomplished by estimating the expected value of the [[indicator function]] <math>A_E(x) \equiv \mathbf{1}_E(x)</math>, which is 1 when <math>E(x) \in [E, E + \Delta E]</math> and zero otherwise.
Line 110 ⟶ 109:
==Step-by-step instructions==
[[
Suppose that the most recent value sampled is <math>x_t</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'</math> with probability density <math>g(x'
: <math>a = a_1 a_2,</math>
Line 121:
is the probability (e.g., Bayesian posterior) ratio between the proposed sample <math>x'</math> and the previous sample <math>x_t</math>, and
: <math>a_2 = \frac{g(x_t
is the ratio of the proposal density in two directions (from <math>x_t</math> to <math>x'</math> and conversely).
Line 127:
Then the new state <math>x_{t+1}</math> is chosen according to the following rules.
: If <math>a \geq 1{:}</math>
:: <math>x_{t+1} = x',</math>
: else:
:: <math>x_{t+1} =
\begin{cases}
x' & \text{with probability } a, \\
x_t & \text{with probability } 1
\end{cases}
</math>
The Markov chain is started from an arbitrary initial value <math>x_0</math>, and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a [[Sample (statistics)|sample]] from the distribution <math>P(x)</math>.
The algorithm works best if the proposal density matches the shape of the target distribution <math>P(x)</math>, from which direct sampling is difficult, that is <math>g(x'
If a Gaussian proposal density <math>g</math> is used, the variance parameter <math>\sigma^2</math> has to be tuned during the burn-in period.
This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>N</math> samples.
The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian distribution is about 50%, decreasing to about 23% for an <math>N</math>-dimensional Gaussian target distribution.<ref name=Roberts/> These guidelines can work well when sampling from sufficiently regular Bayesian posteriors as they often follow a multivariate normal distribution as can be established using the [[Bernstein–von Mises theorem]].<ref>{{Cite journal |last1=Schmon |first1=Sebastian M. |last2=Gagnon |first2=Philippe |date=2022-04-15 |title=Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics |journal=Statistics and Computing |language=en |volume=32 |issue=2 |pages=28 |doi=10.1007/s11222-022-10080-8 |issn=0960-3174 |pmc=8924149 |pmid=35310543}}</ref>
If <math>\sigma^2</math> is too small, the chain will ''mix slowly'' (i.e., the acceptance rate will be high, but successive samples will move around the space slowly, and the chain will converge only slowly to <math>P(x)</math>).
if <math>\sigma^2</math> is too large, the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>a_1</math> will be very small, and again the chain will converge very slowly. One typically tunes the proposal distribution so that the algorithms accepts on the order of 30% of all samples – in line with the theoretical estimates mentioned in the previous paragraph.
== Bayesian Inference ==
▲[[Image:3dRosenbrock.png|thumb|350px|The result of three [[Markov chain]]s running on the 3D [[Rosenbrock function]] using the Metropolis–Hastings algorithm. The algorithm samples from regions where the [[posterior probability]] is high, and the chains begin to mix in these regions. The approximate position of the maximum has been illuminated. Note that the red points are the ones that remain after the burn-in process. The earlier ones have been discarded.]]
{{main article|Bayesian Inference}}
MCMC can be used to draw samples from the [[posterior distribution]] of a statistical model.
The acceptance probability is given by:
<math>P_{acc}(\theta_i \to \theta^*)=\min\left(1, \frac{\mathcal{L}(y|\theta^*)P(\theta^*)}{\mathcal{L}(y|\theta_i)P(\theta_i)}\frac{Q(\theta_i|\theta^*)}{Q(\theta^*|\theta_i)}\right),</math>
where <math>\mathcal{L}</math> is the [[likelihood]], <math>P(\theta)</math> the prior probability density and <math>Q</math> the (conditional) proposal probability.
==See also==
* [[Genetic algorithm]]s
* [[
* [[Metropolis light transport]]
* [[Multiple-try Metropolis]]
* [[Parallel tempering]]
* [[Particle filter|Sequential Monte Carlo]]
* [[Simulated annealing]]
Line 166:
==References==
{{Reflist|refs=
<ref name="Hastings">{{Cite journal |last=Hastings |first=W.K. |year=1970 |title=Monte Carlo Sampling Methods Using Markov Chains and Their Applications |journal=[[Biometrika]] |volume=57 |issue=1 |pages=97–109 |bibcode=1970Bimka..57...97H |doi=10.1093/biomet/57.1.97 |jstor=2334940 |zbl=0219.65008}}</ref>
<ref name="Teller">Teller, Edward. ''Memoirs: A Twentieth-Century Journey in Science and Politics''. [[Perseus Publishing]], 2001, p. 328</ref>▼
<ref name="Barth">Rosenbluth, Marshall. [https://www.aip.org/history-programs/niels-bohr-library/oral-histories/28636-1 "Oral History Transcript"]. American Institute of Physics</ref>▼
<ref name="Gubernatis">{{
<ref name="Rosenbluth">{{
<!--<ref name="Dyson">{{
<ref name="Roberts">{{Cite journal |last1=Roberts |first1=G.O. |last2=Gelman |first2=A. |last3=Gilks |first3=W.R. |year=1997 |title=Weak convergence and optimal scaling of random walk Metropolis algorithms |url=http://www.stat.columbia.edu/~gelman/research/published/theory7.ps |journal=[[Ann. Appl. Probab.]] |volume=7 |issue=1 |pages=110–120 |citeseerx=10.1.1.717.2582 |doi=10.1214/aoap/1034625254}}</ref>
<ref name="Roberts_Casella">{{
<ref name="Newman_Barkema">{{
▲<ref name=Teller>Teller, Edward. ''Memoirs: A Twentieth-Century Journey in Science and Politics''. [[Perseus Publishing]], 2001, p. 328</ref>
▲<ref name=Barth>Rosenbluth, Marshall. [https://www.aip.org/history-programs/niels-bohr-library/oral-histories/28636-1 "Oral History Transcript"]. American Institute of Physics</ref>
▲<ref name=Gubernatis>{{cite journal |title=Marshall Rosenbluth and the Metropolis Algorithm |author=J.E. Gubernatis |journal=[[Physics of Plasmas]] | volume=12| pages=057303| year=2005| doi=10.1063/1.1887186 | bibcode=2005PhPl...12e7303G |issue=5
▲<ref name=Rosenbluth>{{cite journal |title=Genesis of the Monte Carlo Algorithm for Statistical Mechanics|author=M.N. Rosenbluth |journal=[[AIP Conference Proceedings]] | volume=690 | pages=22–30 | year=2003 | doi=10.1063/1.1632112
▲<ref name=Dyson>{{cite journal |title=Marshall N. Rosenbluth|author=F. Dyson |journal=[[Proceedings of the American Philosophical Society]] | volume=250 | pages=404 | year=2006
▲<ref name="Roberts_Casella">{{cite book |title=Monte Carlo Statistical Methods |url=https://archive.org/details/springer_10.1007-978-1-4757-4145-2 |last1=Robert |first1=Christian |last2=Casella |first2=George |year= 2004 |publisher=Springer |isbn=978-0387212395 }}</ref>
▲<ref name="Newman_Barkema">{{cite book |title=Monte Carlo Methods in Statistical Physics |last1=Newman |first1=M. E. J. |last2=Barkema |first2=G. T. |year= 1999 |publisher=Oxford University Press |___location=USA |isbn=978-0198517979 }}</ref>
}}
==Notes==
{{notelist}}
== Further reading ==
* [[Bernd A. Berg]]. ''Markov Chain Monte Carlo Simulations and Their Statistical Analysis''. Singapore, [[World Scientific]], 2004.
*Chib, Siddhartha;
* [http://www.tandfonline.com/doi/abs/10.1080/03610918.2013.777455#.VOk8J1PF9_c David D. L. Minh and Do Le Minh. "Understanding the Hastings Algorithm." Communications in Statistics - Simulation and Computation, 44:2
* Bolstad, William M. (2010) ''Understanding Computational Bayesian Statistics'', [[John Wiley & Sons]] {{ISBN|0-470-04609-0}}
{{DEFAULTSORT:Metropolis-Hastings Algorithm}}
|