Revision as of 08:49, 24 November 2019 edit Mikhail Ryazanov (talk \| contribs) Extended confirmed users 24,662 edits m →History: style, punct., wiki ← Previous edit		Revision as of 09:02, 24 November 2019 edit undo Mikhail Ryazanov (talk \| contribs) Extended confirmed users 24,662 edits m →Intuition: style, punct., fmt. Next edit →
Line 26: Let <math>f(x)</math> be a function that is proportional to the desired probability distribution <math>P(x)</math> (a.k.a. a target distribution). # Initialization: Choose an arbitrary point <math>x_t</math> to be the first sample, and choose an arbitrary probability density <math>g(x\|y)</math> (sometimes written <math>Q(x\|y)</math>) that suggests a candidate for the next sample value <math>x</math>, given the previous sample value <math>y</math>. For the Metropolis algorithm, <math>g</math> must be symmetric; in other words, it must satisfy <math>g(x\|y) = g(y\|x)</math>. A usual choice is to let <math>g(x\|y)</math> be a [[Gaussian distribution]] centered at <math>y</math>, so that points closer to <math>y</math> are more likely to be visited ~~next—making~~next, making the sequence of samples into a [[random walk]]. The function <math>g</math> is referred to as the ''proposal density'' or ''jumping distribution''. # For each iteration ''t'': #* '''Generate''~~' : Generate~~ a candidate <math>x'</math> for the next sample by picking from the distribution <math>g(x'\|x_t)</math>. #* '''Calculate''' ~~: Calculate~~ the ''acceptance ratio'' <math ~~display="inline"~~>\alpha = f(x')/f(x_t)</math>, which will be used to decide whether to accept or reject the candidate. Because ''f'' is proportional to the density of ''P'', we have that <math>\alpha = f(x')/f(x_t) = P(x')/P(x_t)</math>. #* '''Accept or ~~Reject~~reject''' : # Generate a uniform random number <math>u~~</math>~~ on\in [0, 1]</math>. # If <math>u \le \alpha</math>, then ''accept'' the candidate by setting <math>x_{t+1} = x'</math>, #** If <math>u > \alpha</math>, then ''reject'' the candidate and set <math>x_{t+1} = x_t</math>, instead. This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio <math>\alpha</math> indicates how probable the new proposed sample is with respect to the current sample, according to the distribution <math>~~\displaystyle~~ P(x)</math>. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of <math>~~\displaystyle~~ P(x)</math>), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the more the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of <math>~~\displaystyle~~ P(x)</math>, while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works, and returns samples that follow the desired distribution <math>~~\displaystyle~~ P(x)</math>. Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal \|title = Adaptive Rejection Sampling for Gibbs Sampling \|jstor = 2347565 \|journal = Journal of the Royal Statistical Society. Series C (Applied Statistics) \|date = 1992-01-01 \|pages = 337–348 \|volume = 41 \|issue = 2 \|doi = 10.2307/2347565 \|first = W. R. \|last = Gilks \|first2 = P. \|last2 = Wild}}</ref> that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:▼ * The samples are correlated. Even though over the long term they do correctly follow <math>~~\displaystyle~~ P(x)</math>, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every ''n''th sample, for some value of ''n'' (typically determined by examining the [[autocorrelation]] between adjacent samples). Autocorrelation can be reduced by increasing the ''jumping width'' (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a ''slow-mixing'' Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution.▼ * Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a ''burn-in'' period is typically necessary,<ref>{{Cite book \|title=Bayesian data analysis \|date=2004 \|publisher=Chapman & Hall / CRC \|others=Gelman, Andrew. \|isbn=978-1584883883 \|edition= 2nd \|___location=Boca Raton, Fla. \|oclc=51991499}}</ref> where an initial number of samples (e.g. the first ~~1,000~~1000 or so) are thrown away.▼ ▲Compared with an algorithm like [[adaptive rejection sampling]]<ref name=":0">{{Cite journal\|title = Adaptive Rejection Sampling for Gibbs Sampling\|jstor = 2347565\|journal = Journal of the Royal Statistical Society. Series C (Applied Statistics)\|date = 1992-01-01\|pages = 337–348\|volume = 41\|issue = 2\|doi = 10.2307/2347565\|first = W. R.\|last = Gilks\|first2 = P.\|last2 = Wild}}</ref> that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages: ▲The samples are correlated. Even though over the long term they do correctly follow <math>\displaystyle P(x)</math>, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every ''n''th sample, for some value of ''n'' (typically determined by examining the [[autocorrelation]] between adjacent samples). Autocorrelation can be reduced by increasing the ''jumping width'' (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a ''slow-mixing'' Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution. ▲Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a ''burn-in'' period is typically necessary,<ref>{{Cite book\|title=Bayesian data analysis\|date=2004\|publisher=Chapman & Hall/CRC\|others=Gelman, Andrew.\|isbn=978-1584883883\|edition= 2nd\|___location=Boca Raton, Fla.\|oclc=51991499}}</ref> where an initial number of samples (e.g. the first 1,000 or so) are thrown away. On the other hand, most simple [[rejection sampling]] methods suffer from the "[[curse of dimensionality]]", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from [[hierarchical Bayesian model]]s and other high-dimensional statistical models used nowadays in many disciplines. In [[multivariate distribution\|multivariate]] distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the ~~right~~suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as [[Gibbs sampling]], involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. This is especially applicable when the multivariate distribution is composed of a set of individual [[random variable]]s in which each variable is conditioned on only a small number of other variables, as is the case in most typical [[hierarchical Bayesian model\|hierarchical ~~model~~models]]s. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the [[adaptive rejection sampling]] methods,<ref name=":0" /><ref>{{Cite journal \|title = Concave-Convex Adaptive Rejection Sampling \|journal = Journal of Computational and Graphical Statistics \|date = 2011-01-01 \|issn = 1061-8600 \|pages = 670–691 \|volume = 20 \|issue = 3 \|doi = 10.1198/jcgs.2011.09058 \|first = Dilan \|last = Görür \|first2 = Yee Whye \|last2 = Teh}}</ref><ref>{{Cite journal \|title = A Rejection Technique for Sampling from T-concave Distributions \|journal = ACM Trans. Math. Softw. \|date = 1995-06-01 \|issn = 0098-3500 \|pages = 182–193 \|volume = 21 \|issue = 2 \|doi = 10.1145/203082.203089 \|first = Wolfgang \|last = Hörmann\|citeseerx = 10.1.1.56.6055}}</ref><ref>{{Cite journal\|title = A generalization of the adaptive rejection sampling algorithm \|journal = Statistics and Computing \|date = 2010-08-25 \|issn = 0960-3174 \|pages = 633–647 \|volume = 21 \|issue = 4 \|doi = 10.1007/s11222-010-9197-9 \|first = Luca \|last = Martino \|first2 = Joaquín \|last2 = Míguez \|hdl = 10016/16624}}</ref> the adaptive rejection Metropolis sampling algorithm<ref>{{Cite journal \|title = Adaptive Rejection Metropolis Sampling within Gibbs Sampling \|jstor = 2986138 \|journal = Journal of the Royal Statistical Society. Series C (Applied Statistics) \|date = 1995-01-01 \|pages = 455–472 \|volume = 44 \|issue = 4 \|doi = 10.2307/2986138 \|first = W. R. \|last = Gilks \|first2 = N. G. \|last2 = Best \|author2-link= Nicky Best \|first3 = K. K. C. \|last3 = Tan}}</ref> or its improvements<ref>{{Cite journal \|title = Independent Doubly Adaptive Rejection Metropolis Sampling Within Gibbs Sampling \|journal = IEEE Transactions on Signal Processing \|date = 2015-06-01 \|issn = 1053-587X \|pages = 3123–3138 \|volume = 63 \|issue = 12 \|doi = 10.1109/TSP.2015.2420537 \|first = L. \|last = Martino \|first2 = J. \|last2 = Read \|first3 = D. \|last3 = Luengo \|arxiv = 1205.5494 \|bibcode = 2015ITSP...63.3123M}}</ref><ref>{{Cite journal \|title = Adaptive rejection Metropolis sampling using Lagrange interpolation polynomials of degree 2 \|journal = Computational Statistics & Data Analysis \|date = 2008-03-15 \|pages = 3408–3423 \|volume = 52 \|issue = 7 \|doi = 10.1016/j.csda.2008.01.005 \|first = Renate \|last = Meyer \|first2 = Bo \|last2 = Cai \|first3 = François \|last3 = Perron}}</ref> (see [http://a2rms.sourceforge.net matlab code]), a simple one-dimensional Metropolis–Hastings step, or [[slice sampling]]. ==Formal derivation==

Metropolis–Hastings algorithm: Difference between revisions