Metropolis–Hastings algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:30, 17 October 2019 edit AxelBoldt (talk \| contribs) Administrators 44,651 edits clarify, grammar, link Tag: 2017 wikitext editor ← Previous edit		Latest revision as of 20:20, 22 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,178 edits Added article-number. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Monte Carlo methods \| #UCB_Category 35/65
(96 intermediate revisions by 61 users not shown)
Line 1: {{short description\|Monte Carlo algorithm}} ~~[[Image:Metropolis hastings algorithm.png\|thumb\|450px\|The proposal [[probability distribution\|distribution]] ''Q'' proposes the next point to which the [[random walk]] might move.]]~~ [[File:Flowchart-of-Metropolis-Hastings-M-H-algorithm-for-the-parameter-estimation-using-the.png\|thumb\|300px\|The Metropolis-Hastings algorithm sampling a [[Normal distribution\|normal]] one-dimensional [[Posterior probability\|posterior]] probability distribution.]] In [[statistics]] and [[statistical physics]], the '''Metropolis–Hastings algorithm''' is a [[Markov chain Monte Carlo]] (MCMC) method for obtaining a sequence of [[pseudo-random number sampling\|random samples]] from a [[probability distribution]] from which direct sampling is difficult. ~~This~~New samples are added to the sequence in two steps: first a new sample is proposed based on the previous sample, then the proposed sample is either added to the sequence or rejected depending on the value of the probability distribution at that point. The resulting sequence can be used to approximate the distribution (e.g. to generate a [[histogram]]) or to [[Monte Carlo integration\|compute an integral]] (e.g. an [[expected value]]). Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods (e.g. [[adaptive rejection sampling]]) that can directly return independent samples from the distribution, and these are free from the problem of [[autocorrelation\|autocorrelated]] samples that is inherent in MCMC methods. ==History== The algorithm ~~was~~is named ~~after~~in part for [[Nicholas Metropolis]], ~~who~~the ~~authored~~first ~~the~~coauthor of a 1953 paper, entitled ''[[Equation of State Calculations by Fast Computing Machines]]'' ~~together~~, with [[Arianna W. Rosenbluth]], [[Marshall Rosenbluth]], [[Augusta H. Teller]] and [[Edward Teller]]. For ~~This~~many years the algorithm was known simply as the ''Metropolis algorithm''.<ref>{{Cite book \|last1=Kalos \|first1=Malvin H. \|title=Monte Carlo Methods Volume I: Basics \|last2=Whitlock \|first2=Paula A. \|publisher=Wiley \|year=1986 \|___location=New York \|pages=78–88}}</ref><ref>{{Cite journal \|last=Tierney \|first=Luke \|date=1994 \|title=Markov chains for exploring posterior distributions \|url=https://projecteuclid.org/journals/annals-of-statistics/volume-22/issue-4/Markov-Chains-for-Exploring-Posterior-Distributions/10.1214/aos/1176325750.full \|journal=The Annals of Statistics \|volume=22 \|issue=4 \|pages=1701–1762\|doi=10.1214/aos/1176325750 }}</ref> The paper proposed the algorithm for the case of symmetrical proposal distributions, ~~and~~but in 1970, [[W. K. Hastings]] extended it to the more general case ~~in 1970~~.<ref name=Hastings/> The generalized method was eventually identified by both names, although the first use of the term "Metropolis-Hastings algorithm" is unclear. Some controversy exists with regard to credit for development of the Metropolis algorithm. Metropolis, who was familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an earlier ~~paper~~article with [[~~Stanislav~~Stanisław Ulam]]~~, was familiar with the computational aspects of the method~~, and led the group in the Theoretical Division that designed and built the [[MANIAC I]] computer used in the experiments in 1952. However, prior to 2003, there was no detailed account of the algorithm's development. ~~Then, shortly~~Shortly before his death, [[Marshall Rosenbluth]] attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis of the Monte Carlo Algorithm for Statistical Mechanics".<ref name=Rosenbluth/> Further historical clarification is made by Gubernatis in a 2005 journal article<ref name=Gubernatis/> recounting the 50th anniversary conference. Rosenbluth makes it clear that he and his wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time. This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 ~~paper~~article worked together for "days (and nights).".<ref name=Teller/> In contrast, the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of [[statistical mechanics]] and take ensemble averages instead of following detailed [[kinematics]]". This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach --– a topic which he says he had discussed often with [[John von Neumann\|John Von Neumann]]. Arianna Rosenbluth recounted (to Gubernatis in 2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote the code from scratch. In an oral history recorded shortly before his death,<ref name=Barth/>, Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer. ~~In terms of reputation there is little reason to question Rosenbluth's account. In a biographical memoir of Rosenbluth, [[Freeman Dyson]] writes~~ ==Description== ~~{{Quote~~ The Metropolis–Hastings algorithm can draw samples from any [[probability distribution]] with [[probability density]] <math>P(x)</math>, provided that we know a function <math>f(x)</math> proportional to the [[Probability density function\|density]] <math>P</math> and the values of <math>f(x)</math> can be calculated. The requirement that <math>f(x)</math> must only be proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because it removes the need to calculate the density's normalization factor, which is often extremely difficult in practice. \|text=Many times I came to Rosenbluth, asking him a question [...] and receiving an answer in two minutes. Then it would usually take me a week of hard work to understand in detail why Rosenbluth's answer was right. He had an amazing ability to see through a complicated physical situation and reach the right answer by physical arguments. Enrico Fermi was the only other physicist I have known who was equal to Rosenbluth in his intuitive grasp of physics.<ref name=Dyson/> }} The Metropolis–Hastings algorithm generates a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values are produced iteratively in such a way, that the distribution of the next sample depends only on the current sample value, which makes the sequence of samples a [[Markov chain]]. Specifically, at each iteration, the algorithm proposes a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted, in which case the candidate value is used in the next iteration, or it is rejected in which case the candidate value is discarded, and the current value is reused in the next iteration. The probability of acceptance is determined by comparing the values of the function <math>f(x)</math> of the current and candidate sample values with respect to the desired distribution. ~~==Intuition==~~ The Metropolis–Hastings algorithm can draw samples from any [[probability distribution]] <math>P(x)</math>, provided that we know a function <math>f(x)</math> proportional to the [[Probability density function\|density]] of <math>P</math> and the values of <math>f(x)</math> can be calculated. The requirement that <math>f(x)</math> must only be proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because calculating the necessary normalization factor is often extremely difficult in practice. The method used to propose new candidates is characterized by the probability distribution <math>g(x\mid y)</math> (sometimes written <math>Q(x\mid y)</math>) of a new proposed sample <math>x</math> given the previous sample <math>y</math>. This is called the ''proposal density'', ''proposal function'', or ''jumping distribution''. A common choice for <math>g(x\mid y)</math> is a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited next, making the sequence of samples into a [[Gaussian random walk]]. In the original paper by Metropolis et al. (1953), <math>g(x\mid y)</math> was suggested to be a uniform distribution limited to some maximum distance from <math>y</math>. More complicated proposal functions are also possible, such as those of [[Hamiltonian Monte Carlo]], [[Langevin Monte Carlo]], or [[preconditioned Crank–Nicolson]]. The Metropolis–Hastings algorithm works by generating a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution <math>P(x)</math>. These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value (thus making the sequence of samples into a [[Markov chain]]). Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted (in which case the candidate value is used in the next iteration) or rejected (in which case the candidate value is discarded, and current value is reused in the next iteration)—the probability of acceptance is determined by comparing the values of the function <math>f(x)</math> of the current and candidate sample values with respect to the desired distribution <math>P(x)</math>. For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below. <!---The sample values are linked in a [[Markov chain]], which means that the probability of each sample is conditionally independent of any earlier sample, given the sample immediately before it. In other words, general idea is to generate a sequence of samples which are linked in a [[Markov chain]]; in other words, where each sample in the sequence is conditionally independent of any earlier sample, given the sample immediately before it. The procedure for choosing successive samples guarantees that the distribution of sample values will match the desired distribution ~~''P''(''x'')~~ after a long time.!--> ~~'''~~;Metropolis algorithm (symmetric proposal distribution)~~'''~~: Let <math>f(x)</math> be a function that is proportional to the desired probability density function <math>P(x)</math> (a.k.a. a target distribution){{efn\|In the original paper by Metropolis et al. (1953), <math>f</math> was taken to be the [[Boltzmann distribution]] as the specific application considered was [[Monte Carlo integration]] of [[equation of state\|equations of state]] in [[physical chemistry]]; the extension by Hastings generalized to an arbitrary distribution <math>f</math>.}}. # Initialization: Choose an arbitrary point <math>x_t</math> to be the first observation in the sample and choose a proposal function <math>g(x\mid y)</math>. In this section, <math>g</math> is assumed to be symmetric; in other words, it must satisfy <math>g(x\mid y) = g(y\mid x)</math>. # For each iteration ''t'': #* ''Propose'' a candidate <math>x'</math> for the next sample by picking from the distribution <math>g(x'\mid x_t)</math>. #* ''Calculate'' the ''acceptance ratio'' <math>\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate{{efn\|In the original paper by Metropolis et al. (1953), <math>f</math> was actually the [[Boltzmann distribution]], as it was applied to physical systems in the context of [[statistical mechanics]] (e.g., a maximal-entropy distribution of microstates for a given temperature at thermal equilibrium). Consequently, the acceptance ratio was itself an exponential of the difference in the parameters of the numerator and denominator of this ratio.}}. Because ''f'' is proportional to the density of ''P'', we have that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>. #* ''Accept or reject'': # Generate a uniform random number <math>u \in [0, 1]</math>. # If <math>u \le \alpha</math>, then ''accept'' the candidate by setting <math>x_{t+1} = x'</math>, #** If <math>u > \alpha</math>, then ''reject'' the candidate and set <math>x_{t+1} = x_t</math> instead. This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. <math>P(x)</math> at specific point <math>x</math> is proportional to the iterations spent on the point by the algorithm. Note that the acceptance ratio <math>\alpha</math> indicates how probable the new proposed sample is with respect to the current sample, according to the distribution whose density is <math>P(x)</math>. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of <math>P(x)</math> corresponding to an <math>\alpha > 1 \ge u</math>), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the larger the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of <math>P(x)</math>, while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works and returns samples that follow the desired distribution with density <math>P(x)</math>. ~~Let <math>f(x)</math> be a function that is proportional to the desired probability distribution <math>P(x)</math> (a.k.a. a target distribution).~~ Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal \|last1=Gilks \|first1=W. R. \|last2=Wild \|first2=P. \|date=1992-01-01 \|title=Adaptive Rejection Sampling for Gibbs Sampling \|journal=Journal of the Royal Statistical Society. Series C (Applied Statistics) \|volume=41 \|issue=2 \|pages=337–348 \|doi=10.2307/2347565 \|jstor=2347565}}</ref> that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages: # Initialization: Choose an arbitrary point <math>x_0</math> to be the first sample, and choose an arbitrary probability density <math>g(x\|y)</math> (sometimes written <math>Q(x\|y)</math>) that suggests a candidate for the next sample value <math>x</math>, given the previous sample value <math>y</math>. For the Metropolis algorithm, <math>g</math> must be symmetric; in other words, it must satisfy <math>g(x\|y) = g(y\|x)</math>. A usual choice is to let <math>g(x\|y)</math> be a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited next—making the sequence of samples into a [[random walk]]. The function <math>g</math> is referred to as the ''proposal density'' or ''jumping distribution''. * The samples are [[autocorrelation\|autocorrelated]]. Even though over the long term they do correctly follow <math>P(x)</math>, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that effective sample sizes can be significantly lower than the number of samples actually taken, leading to large errors. ~~# For each iteration ''t'':~~ * Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a ''burn-in'' period is typically necessary,<ref>{{Cite book \|title=Bayesian data analysis \|date=2004 \|publisher=Chapman & Hall / CRC \|others=Gelman, Andrew \|isbn=978-1584883883 \|edition=2nd \|___location=Boca Raton, Fla. \|oclc=51991499}}</ref> where an initial number of samples are thrown away. ~~#* '''Generate''' : Generate a candidate <math>x'</math> for the next sample by picking from the distribution <math>g(x'\|x_t)</math>.~~ #* '''Calculate''' : Calculate the ''acceptance ratio'' <math display="inline">\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate. Because ''f'' is proportional to the density of ''P'', we have that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>. ~~#* '''Accept or Reject''' :~~ ~~# Generate a uniform random number <math>u</math> on [0,1].~~ ~~# If <math>u \le \alpha</math> ''accept'' the candidate by setting <math>x_{t+1} = x'</math>,~~ ~~#** If <math>u > \alpha</math> ''reject'' the candidate and set <math>x_{t+1}=x_t</math>, instead.~~ On the other hand, most simple [[rejection sampling]] methods suffer from the "[[curse of dimensionality]]", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from [[hierarchical Bayesian model]]s and other high-dimensional statistical models used nowadays in many disciplines. This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio <math>\alpha</math> indicates how probable the new proposed sample is with respect to the current sample, according to the distribution <math>\displaystyle P(x)</math>. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of <math>\displaystyle P(x)</math>), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the more the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of <math>\displaystyle P(x)</math>, while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works, and returns samples that follow the desired distribution <math>\displaystyle P(x)</math>. In [[multivariate distribution\|multivariate]] distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as [[Gibbs sampling]], involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. That way, the problem of sampling from potentially high-dimensional space will be reduced to a collection of problems to sample from small dimensionality.<ref>{{Cite journal \|last=Lee \|first=Se Yoon \|year=2021 \|title=Gibbs sampler and coordinate ascent variational inference: A set-theoretical review \|journal=Communications in Statistics - Theory and Methods \|volume=51 \|issue=6 \|pages=1549–1568 \|arxiv=2008.01006 \|doi=10.1080/03610926.2021.1921214 \|s2cid=220935477}}</ref> This is especially applicable when the multivariate distribution is composed of a set of individual [[random variable]]s in which each variable is conditioned on only a small number of other variables, as is the case in most typical [[hierarchical Bayesian model\|hierarchical models]]. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the [[adaptive rejection sampling]] methods,<ref name=":0" /> the adaptive rejection Metropolis sampling algorithm,<ref>{{Cite journal \|last1=Gilks \|first1=W. R. \|last2=Best \|first2=N. G. \|author-link2=Nicky Best \|last3=Tan \|first3=K. K. C. \|date=1995-01-01 \|title=Adaptive Rejection Metropolis Sampling within Gibbs Sampling \|journal=Journal of the Royal Statistical Society. Series C (Applied Statistics) \|volume=44 \|issue=4 \|pages=455–472 \|doi=10.2307/2986138 \|jstor=2986138}}</ref> a simple one-dimensional Metropolis–Hastings step, or [[slice sampling]]. Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal\|title = Adaptive Rejection Sampling for Gibbs Sampling\|jstor = 2347565\|journal = Journal of the Royal Statistical Society. Series C (Applied Statistics)\|date = 1992-01-01\|pages = 337–348\|volume = 41\|issue = 2\|doi = 10.2307/2347565\|first = W. R.\|last = Gilks\|first2 = P.\|last2 = Wild}}</ref> that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages: The samples are correlated. Even though over the long term they do correctly follow <math>\displaystyle P(x)</math>, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every ''n''th sample, for some value of ''n'' (typically determined by examining the [[autocorrelation]] between adjacent samples). Autocorrelation can be reduced by increasing the ''jumping width'' (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a ''slow-mixing'' Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution. Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a ''burn-in'' period is typically necessary,<ref>{{Cite book\|title=Bayesian data analysis\|date=2004\|publisher=Chapman & Hall/CRC\|others=Gelman, Andrew.\|isbn=978-1584883883\|edition= 2nd\|___location=Boca Raton, Fla.\|oclc=51991499}}</ref> where an initial number of samples (e.g. the first 1,000 or so) are thrown away. On the other hand, most simple [[rejection sampling]] methods suffer from the "[[curse of dimensionality]]", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from [[hierarchical Bayesian model]]s and other high-dimensional statistical models used nowadays in many disciplines. In [[multivariate distribution\|multivariate]] distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the right jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as [[Gibbs sampling]], involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. This is especially applicable when the multivariate distribution is composed of a set of individual [[random variable]]s in which each variable is conditioned on only a small number of other variables, as is the case in most typical [[hierarchical Bayesian model\|hierarchical model]]s. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the [[adaptive rejection sampling]] methods,<ref name=":0" /><ref>{{Cite journal\|title = Concave-Convex Adaptive Rejection Sampling\|journal = Journal of Computational and Graphical Statistics\|date = 2011-01-01\|issn = 1061-8600\|pages = 670–691\|volume = 20\|issue = 3\|doi = 10.1198/jcgs.2011.09058\|first = Dilan\|last = Görür\|first2 = Yee Whye\|last2 = Teh}}</ref><ref>{{Cite journal\|title = A Rejection Technique for Sampling from T-concave Distributions\|journal = ACM Trans. Math. Softw.\|date = 1995-06-01\|issn = 0098-3500\|pages = 182–193\|volume = 21\|issue = 2\|doi = 10.1145/203082.203089\|first = Wolfgang\|last = Hörmann\|citeseerx = 10.1.1.56.6055}}</ref><ref>{{Cite journal\|title = A generalization of the adaptive rejection sampling algorithm\|journal = Statistics and Computing\|date = 2010-08-25\|issn = 0960-3174\|pages = 633–647\|volume = 21\|issue = 4\|doi = 10.1007/s11222-010-9197-9\|first = Luca\|last = Martino\|first2 = Joaquín\|last2 = Míguez\|hdl = 10016/16624}}</ref> the adaptive rejection Metropolis sampling algorithm<ref>{{Cite journal\|title = Adaptive Rejection Metropolis Sampling within Gibbs Sampling\|jstor = 2986138\|journal = Journal of the Royal Statistical Society. Series C (Applied Statistics)\|date = 1995-01-01\|pages = 455–472\|volume = 44\|issue = 4\|doi = 10.2307/2986138\|first = W. R.\|last = Gilks\|first2 = N. G.\|last2 = Best\|author2-link= Nicky Best \|first3 = K. K. C.\|last3 = Tan}}</ref> or its improvements<ref>{{Cite journal\|title = Independent Doubly Adaptive Rejection Metropolis Sampling Within Gibbs Sampling\|journal = IEEE Transactions on Signal Processing\|date = 2015-06-01\|issn = 1053-587X\|pages = 3123–3138\|volume = 63\|issue = 12\|doi = 10.1109/TSP.2015.2420537\|first = L.\|last = Martino\|first2 = J.\|last2 = Read\|first3 = D.\|last3 = Luengo\|arxiv = 1205.5494\|bibcode = 2015ITSP...63.3123M}}</ref><ref>{{Cite journal\|title = Adaptive rejection Metropolis sampling using Lagrange interpolation polynomials of degree 2\|journal = Computational Statistics & Data Analysis\|date = 2008-03-15\|pages = 3408–3423\|volume = 52\|issue = 7\|doi = 10.1016/j.csda.2008.01.005\|first = Renate\|last = Meyer\|first2 = Bo\|last2 = Cai\|first3 = François\|last3 = Perron}}</ref> (see [http://a2rms.sourceforge.net matlab code]), a simple one-dimensional Metropolis–Hastings step, or [[slice sampling]]. ==Formal derivation== The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution <math>P(x)</math>. To accomplish this, the algorithm uses a [[Markov process]], which asymptotically reaches a unique [[Markov chain#Steady-state analysis and limiting distributions\|stationary distribution]] <math>\pi(x)</math> such that <math>\pi(x) = P(x)</math> .<ref name=Roberts_Casella/> A Markov process is uniquely defined by its transition probabilities, <math>P(x' \|\mid x)</math>, the probability of transitioning from any given state, <math>x</math>, to any other given state, <math>x'</math>. It has a unique stationary distribution <math>\pi(x)</math> when the following two conditions are met:<ref name=Roberts_Casella/> # ''~~'existence~~Existence of stationary distribution''': there must exist a stationary distribution <math>\pi(x)</math>. A sufficient but not necessary condition is [[~~Markov chain#Reversible Markov chain\|~~detailed balance]], which requires that each transition <math>x \~~rightarrow~~to x'</math> is reversible: for every pair of states <math>x, x'</math>, the probability of being in state <math>x</math> and transitioning to state <math>x'</math> must be equal to the probability of being in state <math>x'</math> and transitioning to state <math>x</math>, <math>\pi(x) P(x' \|\mid x) = \pi(x') P(x \|\mid x')</math>. # ''~~'uniqueness~~Uniqueness of stationary distribution''': the stationary distribution <math>\pi(x)</math> must be unique. This is guaranteed by [[Markov Chain#Ergodicity\|ergodicity]] of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite. The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition probabilities) ~~which~~that fulfills the two above conditions, such that its stationary distribution <math>\pi(x)</math> is chosen to be <math>P(x)</math>. The derivation of the algorithm starts with the condition of [[detailed balance]]: : <math>P(x' \|\mid x)~~P(x) =~~ P(~~x',~~ x) = P(x \|\mid x') P(x'),</math> which is re-written as : <math>\frac{P(x' \|\mid x)}{P(x \|\mid x')} = \frac{P(x')}{P(x)}.</math>. The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The ~~'''~~proposal distribution~~'''~~ <math>~~\displaystyle~~ g(x' \|\mid x)</math> is the conditional probability of proposing a state <math>x'</math> given <math>x</math>, and the ~~'''~~acceptance ~~ratio'''~~distribution <math>~~\displaystyle~~ A(x' , x)</math> is the probability to accept the proposed state <math>x'</math>. The transition probability can be written as the product of them: : <math>P(x'\|\mid x) = g(x' \|\mid x) A(x' , x).</math> . Inserting this relation in the previous equation, we have : <math>\frac{A(x' , x)}{A(x , x')} = \frac{P(x')}{P(x)}\frac{g(x \|\mid x')}{g(x' \|\mid x)}.</math> . The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One common choice is the Metropolis choice: : <math>A(x' , x) = \min\left(1, \frac{P(x')}{P(x)} \frac{g(x \|\mid x')}{g(x' \|\mid x)}\right).</math> For this Metropolis acceptance ratio <math>A</math>, either <math>A(x', x) = 1</math> or <math>A(x, x') = 1</math> and, either way, the condition is satisfied. ~~The Metropolis–Hastings algorithm thus consists in the following:~~ The Metropolis–Hastings algorithm can thus be written as follows: # Initialise ## Pick an initial state <math>x_0</math>;. ## Set <math>t = 0</math>;. # Iterate ## '''Generate:''' ~~randomly generate~~ a random candidate state <math>x'</math> according to <math>g(x' \|\mid x_t)</math>;. ## '''Calculate:''~~' calculate~~ the acceptance probability <math ~~display="inline"~~>A(x' , x_t) = \min\left(1, \frac{P(x')}{P(x_t)} \frac{g(x_t \|\mid x')}{g(x' \|\mid x_t)}\right)</math>; . ## '''Accept or ~~Reject:~~reject''' : ### generate a uniform random number <math>u \in [0, 1]</math>; ### if <math>u \le A(x' , x_t)</math>, then ''accept'' the new state and set <math>x_{t+1} = x'</math>; ### if <math>u > A(x' , x_t)</math>, then ''reject'' the new state, and copy the old state forward <math>x_{t+1} = x_{t}</math>;. ## '''Increment:''' : set <math ~~display="inline"~~>t = t + 1</math>;. Provided that specified conditions are met, the empirical distribution of saved states <math>x_0, \ldots, x_T</math> will approach <math>P(x)</math>. The number of iterations (<math>T</math>) required to effectively estimate <math>P(x)</math> depends on the number of factors, including the relationship between <math>P(x)</math> and the proposal distribution and the desired accuracy of estimation.<ref>Raftery, Adrian E., and Steven Lewis. "How Many Iterations in the Gibbs Sampler?." ''In Bayesian Statistics 4''. 1992.</ref> For distribution on discrete state spaces, it has to be of the order of the [[autocorrelation]] time of the Markov process.<ref name=Newman_Barkema/> It is important to notice that it is not clear, in a general problem, which distribution <math>~~\displaystyle~~ g(x' \|\mid x)</math> one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand. ~~== Use in numerical integration ==~~ ==Use in numerical integration== {{main\|Monte Carlo integration}} A common use of Metropolis–Hastings algorithm is to compute an integral. Specifically, consider a space <math>\Omega \subset \mathbb{R}</math> and a probability distribution <math>P(x)</math> over <math>\Omega</math>, <math>x \in \Omega</math>. ~~Metropolis-Hastings~~Metropolis–Hastings can estimate an integral of the form of : <math>P(E) = \int_\Omega A(x) P(x) \,dx,</math> ~~:<math>~~ ~~P(E) = \int_\Omega A(x) P(x) dx~~ where <math>A(x)</math> is a (measurable) function of interest. ~~</math>~~ ~~where A(x) is an arbitrary function of interest.~~ For example, consider a [[statistic]] E(x) and its probability distribution P(E), which is a [[marginal distribution]]. Suppose that the goal is to estimate P(E) for E on the tail of P(E). Formally, P(E) can be written as For example, consider a [[statistic]] <math>E(x)</math> and its probability distribution <math>P(E)</math>, which is a [[marginal distribution]]. Suppose that the goal is to estimate <math>P(E)</math> for <math>E</math> on the tail of <math>P(E)</math>. Formally, <math>P(E)</math> can be written as ~~:<math>~~ ~~P(E) = \int_\Omega P(E\|x) P(x) dx = \int_\Omega \delta(E - E(x)) P(x) dx=E_X(P(E\|X))~~ : <math> P(E) = \int_\Omega P(E\mid x) P(x) \,dx = \int_\Omega \delta\big(E - E(x)\big) P(x) \,dx = E \big(P(E\mid X)\big) </math> and, thus, estimating <math>P(E)</math> can be accomplished by estimating the expected value of the [[indicator function]] <math>A_E(x) \equiv \mathbf{1}_E(x)</math>, which is 1 when <math>E(x) \in [E, E + \Delta E]</math> and zero otherwise. Because <math>E</math> is on the tail of <math>P(E)</math>, the probability to draw a state <math>x</math> with <math>E(x)</math> on the tail of <math>P(E)</math> is proportional to <math>P(E)</math>, which is small by definition. ~~Metropolis-Hastings~~The Metropolis–Hastings algorithm can be used here to sample (rare) states more likely and thus increase the number of samples used to estimate <math>P(E)</math> on the tails. This can be done e.g. by using a sampling distribution <math>\pi(x)</math> to favor those states (e.g. <math>\pi(x) \propto e^{a E}</math> with <math>a > 0</math>). ==Step-by-step instructions== [[File:3dRosenbrock.png\|thumb\|300px\|Three [[Markov chain]]s running on the 3D [[Rosenbrock function]] using the Metropolis–Hastings algorithm. The chains converge and mix in the region where the function is high. The approximate position of the maximum has been illuminated. The red points are the ones that remain after the burn-in process. The earlier ones have been discarded.]] Suppose that the most recent value sampled is <math>x_t\,</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'\,</math> with probability density <math>g(x' \|\mid x_t)\,</math>, and calculate a value : <math>a = a_1 a_2,</math> ~~a = a_1 a_2\,~~ ~~</math>~~ where : <math>a_1 = \frac{P(x')}{P(x_t)}</math> ~~:<math>~~ ~~a_1 = \frac{P(x')}{P(x_t)} \,\!~~ ~~</math>~~ is the probability (e.g., Bayesian posterior) ratio between the proposed sample <math>x'\,</math> and the previous sample <math>x_t\,</math>, and : <math>a_2 = \frac{g(x_t \mid x')}{g(x' \mid x_t)}</math> ~~:<math>~~ ~~a_2 = \frac{g(x_t \| x')}{g(x' \| x_t)}~~ ~~</math>~~ is the ratio of the proposal density in two directions (from <math>x_t\,</math> to <math>x'\,</math> and ~~''vice versa''~~conversely). This is equal to 1 if the proposal density is symmetric. Then the new state <math>~~\displaystyle~~ x_{t+1}</math> is chosen according to the following rules. : If <math>a \geq 1{:}</math> :: <math>x_{t+1} = x',</math> ~~\begin{matrix}~~ : else: ~~\mbox{If } a \geq 1: & \\~~ &:: <math>x_{t+1} = ~~x',~~ \begin{cases} ~~\end{matrix}~~ x' & \text{with probability } a, \\ ~~</math>~~ x_t & \text{with probability } 1-a. ~~:<math>~~ \end{cases} ~~\begin{matrix}~~ ~~\mbox{else} & \\~~ ~~& x_{t+1} = \left\{~~ ~~\begin{array}{lr}~~ ~~x' & \mbox{ with probability }a \\~~ ~~x_t & \mbox{ with probability }1-a.~~ ~~\end{array}~~ ~~\right.~~ ~~\end{matrix}~~ </math> The Markov chain is started from an arbitrary initial value <math>~~\displaystyle~~ x_0</math>, and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a [[Sample (statistics)\|sample]] from the distribution <math>P(x)</math>. ~~These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a [[Sample (statistics)\|sample]] from the distribution <math>P(x)</math>.~~ The algorithm works best if the proposal density matches the shape of the target distribution <math>~~\displaystyle~~ P(x)</math>, from which direct sampling is difficult, that is <math>g(x' \|\mid x_t) \approx P(x') ~~\,\!~~</math>. If a Gaussian proposal density <math>~~\displaystyle~~ g</math> is used, the variance parameter <math>~~\displaystyle~~ \sigma^2</math> has to be tuned during the burn-in period. This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>~~\displaystyle~~ N</math> samples. The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian distribution is ~~approx~~about 50%, decreasing to ~~approx~~about 23% for an <math>~~\displaystyle~~ N</math>-dimensional Gaussian target distribution.<ref name=Roberts/> These guidelines can work well when sampling from sufficiently regular Bayesian posteriors as they often follow a multivariate normal distribution as can be established using the [[Bernstein–von Mises theorem]].<ref>{{Cite journal \|last1=Schmon \|first1=Sebastian M. \|last2=Gagnon \|first2=Philippe \|date=2022-04-15 \|title=Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics \|journal=Statistics and Computing \|language=en \|volume=32 \|issue=2 \|pages=28 \|doi=10.1007/s11222-022-10080-8 \|issn=0960-3174 \|pmc=8924149 \|pmid=35310543}}</ref> If <math>~~\displaystyle~~ \sigma^2</math> is too small, the chain will ''mix slowly'' (i.e., the acceptance rate will be high, but successive samples will move around the space slowly, and the chain will converge only slowly to <math>~~\displaystyle~~ P(x)</math>). On the other hand, if <math>~~\displaystyle~~ \sigma^2</math> is too large, the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>~~\displaystyle~~ a_1</math> will be very small, and again the chain will converge very slowly. One typically tunes the proposal distribution so that the algorithms accepts on the order of 30% of all samples --– in line with the theoretical estimates mentioned in the previous paragraph. == Bayesian Inference == [[Image:3dRosenbrock.png\|thumb\|350px\|The result of three [[Markov chain]]s running on the 3D [[Rosenbrock function]] using the Metropolis-Hastings algorithm. The algorithm samples from regions where the [[posterior probability]] is high and the chains begin to mix in these regions. The approximate position of the maximum has been illuminated. Note that the red points are the ones that remain after the burn-in process. The earlier ones have been discarded.]] {{main article\|Bayesian Inference}} MCMC can be used to draw samples from the [[posterior distribution]] of a statistical model. The acceptance probability is given by: <math>P_{acc}(\theta_i \to \theta^)=\min\left(1, \frac{\mathcal{L}(y\|\theta^)P(\theta^)}{\mathcal{L}(y\|\theta_i)P(\theta_i)}\frac{Q(\theta_i\|\theta^)}{Q(\theta^\|\theta_i)}\right),</math> where <math>\mathcal{L}</math> is the [[likelihood]], <math>P(\theta)</math> the prior probability density and <math>Q</math> the (conditional) proposal probability. ==See also== [[Detailed balance]] * [[Genetic algorithm]]s * [[~~Gibbs~~Mean-field ~~sampling~~particle methods]] * [[Mean field particle methods]] * [[Metropolis-adjusted Langevin algorithm]] * [[Metropolis light transport]] * [[Multiple-try Metropolis]] * [[Parallel tempering]] * [[Preconditioned Crank–Nicolson algorithm]] * [[Particle filter\|Sequential Monte Carlo]] * [[Simulated annealing]] Line 181 ⟶ 166: ==References== {{Reflist\|refs= <ref name="Hastings">{{Cite journal \|last=Hastings \|first=W.K. \|year=1970 \|title=Monte Carlo Sampling Methods Using Markov Chains and Their Applications \|journal=[[Biometrika]] \|volume=57 \|issue=1 \|pages=97–109 \|bibcode=1970Bimka..57...97H \|doi=10.1093/biomet/57.1.97 \|jstor=2334940 \|zbl=0219.65008}}</ref> ~~refs=~~ <ref name="Teller">Teller, Edward. ''Memoirs: A Twentieth-Century Journey in Science and Politics''. [[Perseus Publishing]], 2001, p. 328</ref> ~~<ref name=Hastings>{{cite journal~~ <ref name="Barth">Rosenbluth, Marshall. [https://www.aip.org/history-programs/niels-bohr-library/oral-histories/28636-1 "Oral History Transcript"]. American Institute of Physics</ref> ~~\|first=W.K. \|last=Hastings~~ <ref name="Gubernatis">{{Cite journal \|last=J.E. Gubernatis \|year=2005 \|title=Marshall Rosenbluth and the Metropolis Algorithm \|url=https://zenodo.org/record/1231899 \|journal=[[Physics of Plasmas]] \|volume=12 \|issue=5 \|article-number=057303 \|bibcode=2005PhPl...12e7303G \|doi=10.1063/1.1887186}}</ref> ~~\|title=Monte Carlo Sampling Methods Using Markov Chains and Their Applications~~ <ref name="Rosenbluth">{{Cite journal \|last=M.N. Rosenbluth \|year=2003 \|title=Genesis of the Monte Carlo Algorithm for Statistical Mechanics \|journal=[[AIP Conference Proceedings]] \|volume=690 \|pages=22–30 \|bibcode=2003AIPC..690...22R \|doi=10.1063/1.1632112}}</ref> ~~\|journal=[[Biometrika]]~~ <!--<ref name="Dyson">{{Cite journal \|last=F. Dyson \|year=2006 \|title=Marshall N. Rosenbluth \|journal=[[Proceedings of the American Philosophical Society]] \|volume=250 \|pages=404}}</ref>--> ~~\|volume=57 \|issue=1 \|pages=97–109 \|year=1970~~ <ref name="Roberts">{{Cite journal \|last1=Roberts \|first1=G.O. \|last2=Gelman \|first2=A. \|last3=Gilks \|first3=W.R. \|year=1997 \|title=Weak convergence and optimal scaling of random walk Metropolis algorithms \|url=http://www.stat.columbia.edu/~gelman/research/published/theory7.ps \|journal=[[Ann. Appl. Probab.]] \|volume=7 \|issue=1 \|pages=110–120 \|citeseerx=10.1.1.717.2582 \|doi=10.1214/aoap/1034625254}}</ref> ~~\|jstor=2334940 \| zbl = 0219.65008 \|doi=10.1093/biomet/57.1.97~~ <ref name="Roberts_Casella">{{Cite book \|last1=Robert \|first1=Christian \|url=https://archive.org/details/springer_10.1007-978-1-4757-4145-2 \|title=Monte Carlo Statistical Methods \|last2=Casella \|first2=George \|publisher=Springer \|year=2004 \|isbn=978-0387212395}}</ref> ~~\|bibcode=1970Bimka..57...97H}}</ref>~~ <ref name="Newman_Barkema">{{Cite book \|last1=Newman \|first1=M. E. J. \|title=Monte Carlo Methods in Statistical Physics \|last2=Barkema \|first2=G. T. \|publisher=Oxford University Press \|year=1999 \|isbn=978-0198517979 \|___location=USA}}</ref> ~~<ref name=Teller>Teller, Edward. ''Memoirs: A Twentieth-Century Journey in Science and Politics''. [[Perseus Publishing]], 2001, p. 328</ref>~~ ~~<ref name=Barth>Rosenbluth, Marshall. [https://www.aip.org/history-programs/niels-bohr-library/oral-histories/28636-1 "Oral History Transcript"]. American Institute of Physics</ref>~~ <ref name=Gubernatis>{{cite journal \|title=Marshall Rosenbluth and the Metropolis Algorithm \|author=J.E. Gubernatis \|journal=[[Physics of Plasmas]] \| volume=12\| pages=057303\| year=2005\| doi=10.1063/1.1887186 \| bibcode=2005PhPl...12e7303G \|issue=5 ~~\|url=https://zenodo.org/record/1231899 }}</ref>~~ <ref name=Rosenbluth>{{cite journal \|title=Genesis of the Monte Carlo Algorithm for Statistical Mechanics\|author=M.N. Rosenbluth \|journal=[[AIP Conference Proceedings]] \| volume=690 \| pages=22–30 \| year=2003 \| doi=10.1063/1.1632112 ~~}}</ref>~~ ~~<ref name=Dyson>{{cite journal \|title=Marshall N. Rosenbluth\|author=F. Dyson \|journal=[[Proceedings of the American Philosophical Society]] \| volume=250 \| pages=404 \| year=2006~~ ~~}}</ref>~~ ~~<ref name=Roberts>{{cite journal~~ ~~\|first1=G.O. \|last1=Roberts~~ ~~\|first2=A. \|last2=Gelman~~ ~~\|first3=W.R. \|last3=Gilks~~ ~~\|title=Weak convergence and optimal scaling of random walk Metropolis algorithms~~ ~~\|journal=[[Ann. Appl. Probab.]]~~ ~~\|volume=7 \|issue=1 \|pages=110–120 \|year=1997~~ ~~\|doi=10.1214/aoap/1034625254~~ ~~\|url=http://www.stat.columbia.edu/~gelman/research/published/theory7.ps\|citeseerx=10.1.1.717.2582}}</ref>~~ <ref name="Roberts_Casella">{{cite book \|title=Monte Carlo Statistical Methods \|last1=Robert \|first1=Christian \|last2=Casella \|first2=George \|year= 2004 \|publisher=Springer \|isbn=978-0387212395 }}</ref> <ref name="Newman_Barkema">{{cite book \|title=Monte Carlo Methods in Statistical Physics \|last1=Newman \|first1=M. E. J. \|last2=Barkema \|first2=G. T. \|year= 1999 \|publisher=Oxford University Press \|___location=USA \|isbn=978-0198517979 }}</ref> }} ==Notes== {{notelist}} == Further reading == * [[Bernd A. Berg]]. ''Markov Chain Monte Carlo Simulations and Their Statistical Analysis''. Singapore, [[World Scientific]], 2004. Chib, Siddhartha; ~~Chib and~~Greenberg, Edward ~~Greenberg~~(1995). [https://www.jstor.org/stable/2684568 "Understanding the Metropolis–Hastings Algorithm"]. ''[[The American Statistician]]'', 49(4), 327–335~~, 1995~~. [http://www.tandfonline.com/doi/abs/10.1080/03610918.2013.777455#.VOk8J1PF9_c David D. L. Minh and Do Le Minh. "Understanding the Hastings Algorithm." Communications in Statistics - Simulation and Computation, 44:2 ~~332-349~~332–349, 2015] * Bolstad, William M. (2010) ''Understanding Computational Bayesian Statistics'', [[John Wiley & Sons]] {{ISBN\|0-470-04609-0}} ~~== External links ==~~ * [http://xbeta.org/wiki/show/Metropolis-Hastings+algorithm Metropolis-Hastings algorithm on xβ] * [https://web.archive.org/web/20110405024000/http://www.quantiphile.com/2010/11/01/metropolis-hastings/ Matlab implementation of Random-Walk Metropolis] * [http://blog.abhranil.net/2014/02/08/r-code-for-multivariate-random-walk-metropolis-hastings-sampling/ R implementation of Random-Walk Metropolis] * [http://a2rms.sourceforge.net IA2RMS] is a Matlab code of the ''Independent Doubly Adaptive Rejection Metropolis Sampling'' method for drawing from the full-conditional densities within a Gibbs sampler. * [https://github.com/kirill77/SimpleMetropolisCheck unbiased Metropolis sampling] Simple Visual C++ project which showcases numerical integration using Metropolis sampling without burn-in samples and without bias. Uses idea from Ph.D. thesis of Eric Veach "ROBUST MONTE CARLO METHODS FOR LIGHT TRANSPORT SIMULATION" {{DEFAULTSORT:Metropolis-Hastings Algorithm}}