Metropolis–Hastings algorithm: Difference between revisions

Content deleted Content added
m Use in numerical integration: Can't integrate arbitrary functions, rather only measurable functions.
No edit summary
Line 28:
# Initialization: Choose an arbitrary point <math>x_t</math> to be the first sample and choose an arbitrary probability density <math>g(x|y)</math> (sometimes written <math>Q(x|y)</math>) that suggests a candidate for the next sample value <math>x</math>, given the previous sample value <math>y</math>. In this section, <math>g</math> is assumed to be symmetric; in other words, it must satisfy <math>g(x|y) = g(y|x)</math>. A usual choice is to let <math>g(x|y)</math> be a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited next, making the sequence of samples into a [[random walk]]. The function <math>g</math> is referred to as the ''proposal density'' or ''jumping distribution''.
# For each iteration ''t'':
#* ''Generate'' a candidate <math>x'</math> for the next sample by picking from the distribution <math>g(x'|\mid x_t)</math>.
#* ''Calculate'' the ''acceptance ratio'' <math>\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate. Because ''f'' is proportional to the density of ''P'', we have that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>.
#* ''Accept or reject'':
Line 48:
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution <math>P(x)</math>. To accomplish this, the algorithm uses a [[Markov process]], which asymptotically reaches a unique [[Markov chain#Steady-state analysis and limiting distributions|stationary distribution]] <math>\pi(x)</math> such that <math>\pi(x) = P(x)</math>.<ref name=Roberts_Casella/>
 
A Markov process is uniquely defined by its transition probabilities <math>P(x' |\mid x)</math>, the probability of transitioning from any given state <math>x</math> to any other given state <math>x'</math>. It has a unique stationary distribution <math>\pi(x)</math> when the following two conditions are met:<ref name=Roberts_Casella/>
# ''Existence of stationary distribution'': there must exist a stationary distribution <math>\pi(x)</math>. A sufficient but not necessary condition is [[Markov chain#Reversible Markov chain|detailed balance]], which requires that each transition <math>x \to x'</math> is reversible: for every pair of states <math>x, x'</math>, the probability of being in state <math>x</math> and transitioning to state <math>x'</math> must be equal to the probability of being in state <math>x'</math> and transitioning to state <math>x</math>, <math>\pi(x) P(x' |mid x) = \pi(x') P(x |\mid x')</math>.
# ''Uniqueness of stationary distribution'': the stationary distribution <math>\pi(x)</math> must be unique. This is guaranteed by [[Markov Chain#Ergodicity|ergodicity]] of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite.
 
The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition probabilities) that fulfills the two above conditions, such that its stationary distribution <math>\pi(x)</math> is chosen to be <math>P(x)</math>. The derivation of the algorithm starts with the condition of detailed balance:
 
: <math>P(x' |\mid x) P(x) = P(x |\mid x') P(x'),</math>
 
which is re-written as
 
: <math>\frac{P(x' |\mid x)}{P(x |\mid x')} = \frac{P(x')}{P(x)}.</math>
 
The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The ''proposal distribution'' <math>g(x' |\mid x)</math> is the conditional probability of proposing a state <math>x'</math> given <math>x</math>, and the ''acceptance ratio'' <math>A(x', x)</math> is the probability to accept the proposed state <math>x'</math>. The transition probability can be written as the product of them:
 
: <math>P(x'|\mid x) = g(x' |\mid x) A(x', x).</math>
 
Inserting this relation in the previous equation, we have
 
: <math>\frac{A(x', x)}{A(x, x')} = \frac{P(x')}{P(x)}\frac{g(x |\mid x')}{g(x' |\mid x)}.</math>
 
The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One common choice is the Metropolis choice:
 
: <math>A(x', x) = \min\left(1, \frac{P(x')}{P(x)} \frac{g(x |\mid x')}{g(x' |\mid x)}\right).</math>
 
For this Metropolis acceptance ratio <math>A</math>, either <math>A(x', x) = 1</math> or <math>A(x, x') = 1</math> and, either way, the condition is satisfied.
Line 79:
## Set <math>t = 0</math>.
# Iterate
## ''Generate'' a random candidate state <math>x'</math> according to <math>g(x' |\mid x_t)</math>.
## ''Calculate'' the acceptance probability <math>A(x', x_t) = \min\left(1, \frac{P(x')}{P(x_t)} \frac{g(x_t |\mid x')}{g(x' |\mid x_t)}\right)</math>;.
## ''Accept or reject'':
### generate a uniform random number <math>u \in [0, 1]</math>;
Line 89:
Provided that specified conditions are met, the empirical distribution of saved states <math>x_0, \ldots, x_T</math> will approach <math>P(x)</math>. The number of iterations (<math>T</math>) required to effectively estimate <math>P(x)</math> depends on the number of factors, including the relationship between <math>P(x)</math> and the proposal distribution and the desired accuracy of estimation.<ref>Raftery, Adrian E., and Steven Lewis. "How Many Iterations in the Gibbs Sampler?" ''In Bayesian Statistics 4''. 1992.</ref> For distribution on discrete state spaces, it has to be of the order of the [[autocorrelation]] time of the Markov process.<ref name=Newman_Barkema/>
 
It is important to notice that it is not clear, in a general problem, which distribution <math>g(x' |\mid x)</math> one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand.
 
== Use in numerical integration ==
Line 104:
 
: <math>
P(E) = \int_\Omega P(E|\mid x) P(x) \,dx = \int_\Omega \delta\big(E - E(x)\big) P(x) \,dx = E_X\big(P(E|\mid X)\big)
</math>
and, thus, estimating <math>P(E)</math> can be accomplished by estimating the expected value of the [[indicator function]] <math>A_E(x) \equiv \mathbf{1}_E(x)</math>, which is 1 when <math>E(x) \in [E, E + \Delta E]</math> and zero otherwise.
Line 111:
==Step-by-step instructions==
 
Suppose that the most recent value sampled is <math>x_t</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'</math> with probability density <math>g(x' |\mid x_t)</math> and calculate a value
 
: <math>a = a_1 a_2,</math>
Line 121:
is the probability (e.g., Bayesian posterior) ratio between the proposed sample <math>x'</math> and the previous sample <math>x_t</math>, and
 
: <math>a_2 = \frac{g(x_t |\mid x')}{g(x' |\mid x_t)}</math>
 
is the ratio of the proposal density in two directions (from <math>x_t</math> to <math>x'</math> and conversely).
Line 127:
Then the new state <math>x_{t+1}</math> is chosen according to the following rules.
 
: If <math>a \geq 1{:}</math>:
:: <math>x_{t+1} = x',</math>,
: else:
:: <math>x_{t+1} =
Line 140:
These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a [[Sample (statistics)|sample]] from the distribution <math>P(x)</math>.
 
The algorithm works best if the proposal density matches the shape of the target distribution <math>P(x)</math>, from which direct sampling is difficult, that is <math>g(x' |\mid x_t) \approx P(x')</math>.
If a Gaussian proposal density <math>g</math> is used, the variance parameter <math>\sigma^2</math> has to be tuned during the burn-in period.
This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>N</math> samples.