Content deleted Content added
→Evidence lower bound: Mention that a lower bound is good ("worst-case") for log-evidence. |
|||
Line 71:
As the ''log-[[model evidence|evidence]]'' <math>\log P(\mathbf{X})</math> is fixed with respect to <math>Q</math>, maximizing the final term <math>\mathcal{L}(Q)</math> minimizes the KL divergence of <math>Q</math> from <math>P</math>. By appropriate choice of <math>Q</math>, <math>\mathcal{L}(Q)</math> becomes tractable to compute and to maximize. Hence we have both an analytical approximation <math>Q</math> for the posterior <math>P(\mathbf{Z}\mid \mathbf{X})</math>, and a lower bound <math>\mathcal{L}(Q)</math> for the log-evidence <math>\log P(\mathbf{X})</math> (since the KL-divergence is non-negative).
The lower bound <math>\mathcal{L}(Q)</math> is known as the (negative) '''variational free energy''' in analogy with [[thermodynamic free energy]] because it can also be expressed as a negative energy <math>\operatorname{E}_{Q}[\log P(\mathbf{Z},\mathbf{X})]</math> plus the [[Entropy (information theory)
=== Proofs ===
Line 108:
:<math>Q(\mathbf{Z}) = \prod_{i=1}^M q_i(\mathbf{Z}_i\mid \mathbf{X})</math>
It can be shown using the [[calculus of variations]] (hence the name "variational Bayes") that the "best" distribution <math>q_j^{*}</math> for each of the factors <math>q_j</math> (in terms of the distribution minimizing the KL divergence, as described above) satisfies:<ref name="Nguyen">{{cite web|last=Nguyen|first=Duy|title= AN IN DEPTH INTRODUCTION TO VARIATIONAL BAYES NOTE|date=15 August 2023 |ssrn=4541076 |url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4541076|access-date=15 August 2023}}</ref>
:<math>q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \frac{e^{\operatorname{E}_{q^*_{-j}} [\ln p(\mathbf{Z}, \mathbf{X})]}}{\int e^{\operatorname{E}_{q^*_{-j}} [\ln p(\mathbf{Z}, \mathbf{X})]}\, d\mathbf{Z}_j}</math>
Line 122:
Using the properties of expectations, the expression <math>\operatorname{E}_{q^*_{-j}} [\ln p(\mathbf{Z}, \mathbf{X})]</math> can usually be simplified into a function of the fixed [[hyperparameter]]s of the [[prior distribution]]s over the latent variables and of expectations (and sometimes higher [[moment (mathematics)|moment]]s such as the [[variance]]) of latent variables not in the current partition (i.e. latent variables not included in <math>\mathbf{Z}_j</math>). This creates [[circular dependency|circular dependencies]] between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an [[iterative]] algorithm, much like EM (the [[expectation-maximization]] algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to [[limit of a sequence|converge]].<ref>{{cite book|title=Convex Optimization|first1=Stephen P.|last1=Boyd|first2=Lieven|last2=Vandenberghe|year=2004|publisher=Cambridge University Press|isbn=978-0-521-83378-3|url=https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf|access-date=October 15, 2011}}</ref>
In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the [[mean]]s); sometimes expectations of squared variables (which can be related to the [[variance]] of the variables), or expectations of higher powers (i.e. higher [[moment (mathematics)|moment]]s) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, [[nonlinear]]{{disambiguation needed}} dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.
==A duality formula for variational inference==
Line 355:
==A more complex example==
[[File:bayesian-gaussian-mixture-vb.svg|right|300px|thumb|Bayesian Gaussian mixture model using [[plate notation]]. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size ''K''; [''D'',''D''] means a matrix of size ''D''×''D''; ''K'' alone means a [[categorical variable]] with ''K'' outcomes. The squiggly line coming from ''z'' ending in a crossbar indicates a ''switch'' — the value of this variable selects, for the other incoming variables, which value to use out of the size-''K'' array of possible values.]]
Imagine a Bayesian [[Gaussian mixture model]] described as follows:<ref
:<math>
Line 410:
Assume that <math>q(\mathbf{Z},\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda}) = q(\mathbf{Z})q(\mathbf{\pi},\mathbf{\mu},\mathbf{\Lambda})</math>.
Then<ref name="Nguyen"/>
:<math>
|