Content deleted Content added
m Expectation–maximization w/hyphen including in section title |
Link suggestions feature: 2 links added. Tags: Visual edit Mobile edit Mobile web edit Newcomer task Suggested: add links |
||
(7 intermediate revisions by 5 users not shown) | |||
Line 9:
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to [[Monte Carlo sampling]] methods—particularly, [[Markov chain Monte Carlo]] methods such as [[Gibbs sampling]]—for taking a fully Bayesian approach to [[statistical inference]] over complex [[probability distribution|distributions]] that are difficult to evaluate directly or [[sample (statistics)|sample]]. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the [[expectation–maximization algorithm|expectation–maximization]] (EM) algorithm from [[maximum likelihood estimation|maximum likelihood]] (ML) or [[maximum a posteriori estimation|maximum a posteriori]] (MAP) estimation
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
Line 43:
:<math>
\begin{array}{rl}
D_{\mathrm{KL}}(Q \parallel P)
&= \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z},\mathbf{X})} + \log P(\mathbf{X}) \right]\\
&= \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] + \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log P(\mathbf{X}) \right]
\end{array}
</math>
Line 62 ⟶ 64:
which can be rearranged to become
:<math>
\begin{array}{rl}
\log P(\mathbf{X}) =▼
\log P(\mathbf{X}) &=
D_{\mathrm{KL}}(Q \parallel P) - \mathbb{E}_{\mathbf Q } \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] \\ &= D_{\mathrm{KL}}(Q\parallel P) + \mathcal{L}(Q) \end{array}
</math>
Line 71 ⟶ 76:
===Proofs===
By the generalized [[Pythagorean theorem]] of [[Bregman divergence]], of which KL-divergence is a special case, it can be shown that:<ref name=Tran2018>{{cite arXiv|title=Copula Variational Bayes inference via information geometry|first1=Viet Hung|last1=Tran|year=2018|eprint=1803.10998|class=cs.IT}}</ref><ref name="Martin2014"/>
[[File:Bregman_divergence_Pythagorean.png|right|300px|thumb|Generalized Pythagorean theorem for [[Bregman divergence]]<ref name="Martin2014">{{cite journal |last1=Adamčík |first1=Martin |title=The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning |journal=Entropy |date=2014 |volume=16 |issue=12 |pages=6338–6381|bibcode=2014Entrp..16.6338A |doi=10.3390/e16126338 |doi-access=free }}</ref>]]
:<math>
Line 77 ⟶ 82:
</math>
where <math>\mathcal{C}</math> is a [[convex set]] and the equality holds if:
:<math> Q = Q^{*} \triangleq \arg\min_{Q\in\mathcal{C}}D_{\mathrm{KL}}(Q\parallel P). </math>
Line 83 ⟶ 88:
In this case, the global minimizer <math>Q^{*}(\mathbf{Z}) = q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)q^{*}(\mathbf{Z}_2) = q^{*}(\mathbf{Z}_2\mid\mathbf{Z}_1)q^{*}(\mathbf{Z}_1),</math> with <math>\mathbf{Z}=\{\mathbf{Z_1},\mathbf{Z_2}\},</math> can be found as follows:<ref name=Tran2018/>
:<math>
\begin{array}{rl}
= \frac{P(\mathbf{X})}{\zeta(\mathbf{X})}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} ▼
q^{*}(\mathbf{Z}_2)
= \frac{1}{\zeta(\mathbf{X})}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right),</math>▼
▲&= \frac{P(\mathbf{X})}{\zeta(\mathbf{X})}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\
▲&= \frac{1}{\zeta(\mathbf{X})}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right),
\end{array}</math>
in which the normalizing constant is:
:<math>
\begin{array}{rl}
=P(\mathbf{X})\int_{\mathbf{Z}_2}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))}▼
= \int_{\mathbf{Z}_{2}}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right).</math>▼
▲&=P(\mathbf{X})\int_{\mathbf{Z}_2}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\
▲&= \int_{\mathbf{Z}_{2}}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right).
\end{array}</math>
The term <math>\zeta(\mathbf{X})</math> is often called the [[model evidence|evidence]] lower bound ('''ELBO''') in practice, since <math>P(\mathbf{X})\geq\zeta(\mathbf{X})=\exp(\mathcal{L}(Q^{*}))</math>,<ref name=Tran2018/> as shown above.
Line 116 ⟶ 127:
The constant in the above expression is related to the [[normalizing constant]] (the denominator in the expression above for <math>q_j^{*}</math>) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g. [[Gaussian distribution|Gaussian]], [[gamma distribution|gamma]], etc.).
Using the properties of expectations, the expression <math>\operatorname{E}_{q^*_{-j}} [\ln p(\mathbf{Z}, \mathbf{X})]</math> can usually be simplified into a function of the fixed [[Hyperparameter (Bayesian statistics)|hyperparameter]]s of the [[prior distribution]]s over the latent variables and of expectations (and sometimes higher [[moment (mathematics)|moment]]s such as the [[variance]]) of latent variables not in the current partition (i.e. latent variables not included in <math>\mathbf{Z}_j</math>). This creates [[circular dependency|circular dependencies]] between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an [[iterative]] algorithm, much like EM (the [[expectation–maximization algorithm]]), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to [[limit of a sequence|converge]].<ref>{{cite book|title=Convex Optimization|first1=Stephen P.|last1=Boyd|first2=Lieven|last2=Vandenberghe|year=2004|publisher=Cambridge University Press|isbn=978-0-521-83378-3|url=https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf|access-date=October 15, 2011}}</ref>
In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's [[functional dependency]] on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the [[mean]]s); sometimes expectations of squared variables (which can be related to the [[variance]] of the variables), or expectations of higher powers (i.e. higher [[moment (mathematics)|moment]]s) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, [[nonlinear]] dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.
==A duality formula for variational inference==
Line 129 ⟶ 140:
:<math> \log E_P[\exp h] = \text{sup}_{Q \ll P} \{ E_Q[h] - D_\text{KL}(Q \parallel P)\}.</math>
Further, the supremum on the right-hand side is attained [[if and only if]] it holds
:<math> \frac{q(\theta)}{p(\theta)} = \frac{\exp h(\theta)}{E_P[\exp h]},</math>
Line 152 ⟶ 163:
</math>
The [[Hyperparameter (Bayesian statistics)|hyperparameter]]s <math>\mu_0, \lambda_0, a_0</math> and <math>b_0</math> in the prior distributions are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of <math>\mu</math> and <math>\tau</math>.
We are given <math>N</math> data points <math>\mathbf{X} = \{x_1, \ldots, x_N\}</math> and our goal is to infer the [[posterior distribution]] <math>q(\mu, \tau)=p(\mu,\tau\mid x_1, \ldots, x_N)</math> of the parameters <math>\mu</math> and <math>\tau.</math>
Line 199 ⟶ 210:
</math>
In the above derivation, <math>C</math>, <math>C_2</math> and <math>C_3</math> refer to values that are constant with respect to <math>\mu</math>. Note that the term <math>\operatorname{E}_{\tau}[\ln p(\tau)]</math> is not a function of <math>\mu</math> and will have the same value regardless of the value of <math>\mu</math>. Hence in line 3 we can absorb it into the [[constant term]] at the end. We do the same thing in line 7.
The last line is simply a quadratic polynomial in <math>\mu</math>. Since this is the logarithm of <math>q_\mu^*(\mu)</math>, we can see that <math>q_\mu^*(\mu)</math> itself is a [[Gaussian distribution]].
Line 328 ⟶ 339:
# For a given partition <math>\mathbf{Z}_j</math>, write down the formula for the best approximating distribution <math>q_j^{*}(\mathbf{Z}_j\mid \mathbf{X})</math> using the basic equation <math>\ln q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})] + \text{constant}</math> .
# Fill in the formula for the [[joint probability distribution]] using the graphical model. Any component conditional distributions that don't involve any of the variables in <math>\mathbf{Z}_j</math> can be ignored; they will be folded into the constant term.
# Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in <math>\mathbf{Z}_j</math> (e.g. first or second raw [[moment (mathematics)|moment]]s, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or [[Hyperparameter (Bayesian statistics)|hyperparameter]]s of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition.
# The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the [[probability density function]] (PDF) of the distribution (or at least, something proportional to it, with unknown [[normalization constant]]). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.
# When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.
Line 345 ⟶ 356:
However, there are a number of differences. Most important is ''what'' is being computed.
* EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are the [[mode (statistics)|mode]]s of these parameters; no other information is available.
* VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the [[mean]] is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, the parameters computed in VB do ''not'' have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian [[mixture model]] will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a [[prior distribution]], e.g. a [[normal-scaled inverse gamma distribution]] — and would then compute values for the parameters of this prior distribution, i.e. essentially [[Hyperparameter (Bayesian statistics)|hyperparameters]]. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.
{{clear}}
Line 529 ⟶ 540:
==External links==
* [https://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms] {{Webarchive|url=https://web.archive.org/web/20170512025952/https://www.inference.phy.cam.ac.uk/mackay/itila/ |date=2017-05-12 }}, by [[David J.C. MacKay]] provides an introduction to variational methods (p. 422).
* [https://www.robots.ox.ac.uk/~sjrob/Pubs/fox_vbtut.pdf A Tutorial on Variational Bayes]. Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, {{doi|10.1007/s10462-011-9236-8}}.
* [https://www.gatsby.ucl.ac.uk/vbayes/ Variational-Bayes Repository] A repository of research papers, software, and links related to the use of variational methods for approximate Bayesian learning up to 2003.
|