Metropolis–Hastings algorithm

In mathematics and physics, the Metropolis-Hastings algorithm is an algorithm to generate a sequence of samples from the joint distribution of two or more variables. The purpose of such a sequence is to approximate the joint distribution (as with a histogram), or to compute an integral (such as an expected value). This algorithm is an example of a Markov chain Monte Carlo algorithm. It is a generalization of the Metropolis algorithm suggested by Hastings (citation below). The Gibbs sampling algorithm is a special case of the Metropolis-Hastings algorithm.

The Metropolis-Hastings algorithm can draw samples from any probability distribution $p(x)$ , requiring only that the density can be calculated at $x$ . The algorithm generates a set of states $x^{t}$ which is a Markov chain because each state $x^{t}$ depends only on the previous state $x^{t-1}$ . The algorithm depends on the creation of a proposal density $Q(x';x^{t})$ , which depends on the current state $x^{t}$ and which can generate a new proposed sample $x'$ . For example, the proposal density could be a Gaussian function centred on the current state $x^{t}$

Q(x';x^{t})\sim N(x'-x^{t},\sigma ^{2}I)

reading $Q(x';x^{t})$ as the probability of generating $x'$ given the previous value $x^{t}$ .

This proposal density would generate samples centred around the current state with variance $\sigma ^{2}I$ . So we draw a new proposal state $x'$ with probability $Q(x';x^{t})$ and then calculate a value

a=a_{1}a_{2}\,

where

a_{1}={\frac {P(x')}{P(x^{t})}}

is the likelihood ratio between the proposed sample $x'$ and the previous sample $x^{t}$ , and

a_{2}={\frac {Q(x^{t};x')}{Q(x';x^{t})}}

is the ratio of the proposal density in two directions (from $x^{t}$ to $x'$ and vice versa). This is equal to 1 if the proposal density is symmetric. Then the new state $x^{t+1}$ is chosen with the rule

x^{t+1}=\left\{{\begin{matrix}x'&{\mbox{if }}a>1\\x'{\mbox{ with probability }}a,&{\mbox{if }}a<1\end{matrix}}\right.

The Markov chain is started from a random initial value $x^{0}$ and the algorithm is run for a few thousand iterations so that this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The algorithm works best if the proposal density matches the shape of the target distribution $p(x)$ , that is $Q(x';x^{t})\approx p(x')$ , but in most cases this is unknown. If a Gaussian proposal is used the variance parameter $\sigma ^{2}$ has to be tuned during the burn-in period. This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last $N$ samples. This is usually set to be around 60%. If the proposal steps are too small the chain will mix slowly (i.e., it will move around the space slowly and converge slowly to $p(x)$ ). If the proposal steps are too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density so $a_{1}$ will be very small.

References

Chib, Siddhartha and Edward Greenberg: "Understanding the Metropolis–Hastings Algorithm". American Statistician, 49, 327–335, 1995
W.K. Hastings. "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", Biometrika, 57:97-109, 1970.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. "Equations of State Calculations by Fast Computing Machines". Journal of Chemical Physics, 21:1087-1091, 1953.