Metropolis–Hastings algorithm: Difference between revisions

Content deleted Content added
unencyclopedic language, WP is not a text book, or how-to, addressing the audience
copyedit
Line 5:
 
==History==
The algorithm wasis named afterfor [[Nicholas Metropolis]] and [[W.K. Hastings]]. Metropolis was the first author to appear on the, listcoauthors of authorsa of1953 the 1953paper, articleentitled ''[[Equation of State Calculations by Fast Computing Machines]]'' together, with [[Arianna W. Rosenbluth]], [[Marshall Rosenbluth]], [[Augusta H. Teller]] and [[Edward Teller]]., but Thisfor articlemany proposedyears the algorithm for the case of symmetrical proposal distributions, and for many years was known simply as the "''Metropolis algorithm''."<ref>{{Cite book |last=Kalos |first=Malvin H. |title=Monte Carlo Methods Volume I: Basics |last2=Whitlock |first2=Paula A. |publisher=Wiley |year=1986 |___location=New York |pages=78-88}}</ref><ref>{{Cite journal |last=Tierney |first=Luke |date=1994 |title=Markov chains for exploring posterior distributions |url=https://projecteuclid.org/journals/annals-of-statistics/volume-22/issue-4/Markov-Chains-for-Exploring-Posterior-Distributions/10.1214/aos/1176325750.full |journal=The Annals of Statistics |volume=22 |issue=4 |pages=1701-1762}}</ref> InThe paper proposed the algorithm for the case of symmetrical proposal distributions, but in 1970, Hastings extended it to the more general case.<ref name=Hastings/> The generalized method was eventually wasidentified givenby both names, although the first use of the term "Metropolis-Hastings algorithm" is unclear; it may have first appeared in a 1995 review by Chib and Greenberg.<ref>{{Cite journal |last=Chib |first=Siddhartha |last2=Greenberg |first2=Edward |date=1995 |title=Understanding the Metropolis-Hastings Algorithm |url=https://www.jstor.org/stable/2684568 |journal=The American Statistician |volume=49 |issue=4 |pages=327-335 |via=JSTOR}}</ref>
 
Some controversy exists with regard to credit for development of the Metropolis algorithm. Metropolis, who was familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an earlier article with [[Stanisław Ulam]], and led the group in the Theoretical Division that designed and built the [[MANIAC I]] computer used in the experiments in 1952. Prior However, prior to 2003 there was no detailed account of the algorithm's development. Shortly before his death, [[Marshall Rosenbluth]] attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis of the Monte Carlo Algorithm for Statistical Mechanics".<ref name=Rosenbluth/> Further historical clarification is made by Gubernatis in a 2005 journal article<ref name=Gubernatis/> recounting the 50th anniversary conference. Rosenbluth makes it clear that he and his then-wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time.
 
This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 article worked together for "days (and nights)".<ref name=Teller/> In contrast, the detailed account by Marshall Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of statistical mechanics and take ensemble averages instead of following detailed kinematics". This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach – a topic which he says he had discussed often with [[John von Neumann|John Von Neumann]]. Arianna Rosenbluth recounted (to Gubernatis in 2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote the code from scratch. In an oral history recorded shortly before his death,<ref name=Barth/> Marshall Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer.
 
==Intuition==
The Metropolis–Hastings algorithm can draw samples from any [[probability distribution]] with [[probability density]] <math>P(x)</math>, provided that we know a function <math>f(x)</math> is known, proportional to the [[Probability density function|density]] <math>P</math> and the values of <math>f(x)</math> can be calculated. The requirement that <math>f(x)</math> must only be proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because calculating the necessary normalization factor is often extremely difficult in practice.
 
The Metropolis–Hastings algorithm works by generatinggenerates a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value, (thus making the sequence of samples into a [[Markov chain]]). Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted, (in which case the candidate value is used in the next iteration), or it is rejected (in which case the candidate value is discarded, and current value is reused in the next iteration)—the. The probability of acceptance is determined by comparing the values of the function <math>f(x)</math> of the current and candidate sample values with respect to the desired distribution.
 
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.
Line 20:
general idea is to generate a sequence of samples which are linked in a [[Markov chain]]; in other words, where each sample in the sequence is conditionally independent of any earlier sample, given the sample immediately before it. The procedure for choosing successive samples guarantees that the distribution of sample values will match the desired distribution after a long time.!-->
 
''';Metropolis algorithm (symmetric proposal distribution)''':
 
Let <math>f(x)</math> be a function that is proportional to the desired probability density function <math>P(x)</math> (a.k.a. a target distribution){{efn|In the original paper by Metropolis et al. (1953), <math>f</math> was taken to be the [[Boltzmann distribution]] as the specific application considered was [[Monte Carlo integration]] of [[equation of state|equations of state]] in [[physical chemistry]]; the extension by Hastings generalized to an arbitrary distribution <math>f</math>.}}.
 
# Initialization: Choose an arbitrary point <math>x_t</math> to be the first observation in the sample and choose an arbitrary probability density <math>g(x\mid y)</math> (sometimes written <math>Q(x\mid y)</math>) that suggests a candidate for the next sample value <math>x</math>, given the previous sample value <math>y</math>. In this section, <math>g</math> is assumed to be symmetric; in other words, it must satisfy <math>g(x\mid y) = g(y\mid x)</math>. A usual choice is to let <math>g(x\mid y)</math> be a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited next, making the sequence of samples into a [[random walk]]{{efn|In the original paper by Metropolis et al. (1953), <math>g(x\mid y)</math> was suggested to be a random translation with uniform density over some prescribed range.}}. The function <math>g</math> is referred to as the ''proposal density'' or ''jumping distribution''.
# For each iteration ''t'':
#* ''Generate'' a candidate <math>x'</math> for the next sample by picking from the distribution <math>g(x'\mid x_t)</math>.
#* ''Calculate'' the ''acceptance ratio'' <math>\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate{{efn|In the original paper by Metropolis et al. (1953), <math>f</math> was actually the [[Boltzmann distribution]], as it was applied to physical systems in the context of [[statistical mechanics]] (e.g., a maximal-entropy distribution of microstates for a given temperature at thermal equilibrium). Consequently, the acceptance ratio was itself an exponential of the difference in the parameters of the numerator and denominator of this ratio.}}. Because ''f'' is proportional to the density of ''P'', itwe followshave that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>.
#* ''Accept or reject'':
#** Generate a uniform random number <math>u \in [0, 1]</math>.
Line 33 ⟶ 31:
#** If <math>u > \alpha</math>, then ''reject'' the candidate and set <math>x_{t+1} = x_t</math> instead.
 
This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. The Note that the acceptance ratio <math>\alpha</math> indicates how probable the new proposed sample is with respect to the current sample, according to the distribution whose density is <math>P(x)</math>. If anwe attempt is made to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of <math>P(x)</math> corresponding to an <math>\alpha > 1 \ge u</math>), the algorithmwe will always accept the move. However, if anwe attempt is made to move to a less probable point, thewe move iswill sometimes rejectedreject the move, and the larger the relative drop is in probability, the more likely we are to reject the new point. will be rejected. Thus, thewe algorithmwill tendstend to stay in (and return large numbers of samples from) high-density regions of <math>P(x)</math>, while only occasionally visiting low-density regions. Intuitively, this is the reason thatwhy this algorithm works and returns samples that follow the desired distribution with density <math>P(x)</math>.
 
Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal |last=Gilks |first=W. R. |last2=Wild |first2=P. |date=1992-01-01 |title=Adaptive Rejection Sampling for Gibbs Sampling |journal=Journal of the Royal Statistical Society. Series C (Applied Statistics) |volume=41 |issue=2 |pages=337–348 |doi=10.2307/2347565 |jstor=2347565}}</ref> that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:
Line 62 ⟶ 60:
: <math>P(x'\mid x) = g(x' \mid x) A(x', x).</math>
 
Inserting this relation in the previous equation, provideswe have
 
: <math>\frac{A(x', x)}{A(x, x')} = \frac{P(x')}{P(x)}\frac{g(x \mid x')}{g(x' \mid x)}.</math>
Line 107 ⟶ 105:
 
==Step-by-step instructions==
Given the assumptionSuppose that the most recent value sampled is <math>x_t</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'</math> is drawn next, with probability density <math>g(x' \mid x_t)</math> and calculate a value
 
: <math>a = a_1 a_2,</math>