Variational Bayesian methods: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:49, 18 November 2024 edit 2001:620:618:5d0:2:80b3:0:c2 (talk) Aligned log ← Previous edit		Latest revision as of 13:02, 10 August 2025 edit undo RankASea (talk \| contribs) Extended confirmed users 983 edits Link suggestions feature: 2 links added. Tags: Visual edit Mobile edit Mobile web edit Newcomer task Suggested: add links
(3 intermediate revisions by 3 users not shown)
Line 76: ===Proofs=== By the generalized [[Pythagorean theorem]] of [[Bregman divergence]], of which KL-divergence is a special case, it can be shown that:<ref name=Tran2018>{{cite arXiv\|title=Copula Variational Bayes inference via information geometry\|first1=Viet Hung\|last1=Tran\|year=2018\|eprint=1803.10998\|class=cs.IT}}</ref><ref name="Martin2014"/> [[File:Bregman_divergence_Pythagorean.png\|right\|300px\|thumb\|Generalized Pythagorean theorem for [[Bregman divergence]]<ref name="Martin2014">{{cite journal \|last1=Adamčík \|first1=Martin \|title=The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning \|journal=Entropy \|date=2014 \|volume=16 \|issue=12 \|pages=6338–6381\|bibcode=2014Entrp..16.6338A \|doi=10.3390/e16126338 \|doi-access=free }}</ref>]] :<math> Line 82: </math> where <math>\mathcal{C}</math> is a [[convex set]] and the equality holds if: :<math> Q = Q^{} \triangleq \arg\min_{Q\in\mathcal{C}}D_{\mathrm{KL}}(Q\parallel P). </math> Line 88: In this case, the global minimizer <math>Q^{}(\mathbf{Z}) = q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)q^{}(\mathbf{Z}_2) = q^{}(\mathbf{Z}_2\mid\mathbf{Z}_1)q^{}(\mathbf{Z}_1),</math> with <math>\mathbf{Z}=\{\mathbf{Z_1},\mathbf{Z_2}\},</math> can be found as follows:<ref name=Tran2018/> :<math> ~~q^{}(\mathbf{Z}_2)~~ \begin{array}{rl} = \frac{P(\mathbf{X})}{\zeta(\mathbf{X})}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} ▼ q^{}(\mathbf{Z}_2) = \frac{1}{\zeta(\mathbf{X})}\exp\mathbb{E}_{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right),</math>▼ ▲&= \frac{P(\mathbf{X})}{\zeta(\mathbf{X})}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\ ▲&= \frac{1}{\zeta(\mathbf{X})}\exp\mathbb{E}_{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right),~~</math>~~ \end{array}</math> in which the normalizing constant is: :<math>~~\zeta(\mathbf{X})~~ \begin{array}{rl} =P(\mathbf{X})\int_{\mathbf{Z}_2}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))}▼ \zeta(\mathbf{X}) = \int_{\mathbf{Z}_{2}}\exp\mathbb{E}_{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right).</math>▼ ▲&=P(\mathbf{X})\int_{\mathbf{Z}_2}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\ ▲&= \int_{\mathbf{Z}_{2}}\exp\mathbb{E}_{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right).~~</math>~~ \end{array}</math> The term <math>\zeta(\mathbf{X})</math> is often called the [[model evidence\|evidence]] lower bound ('''ELBO''') in practice, since <math>P(\mathbf{X})\geq\zeta(\mathbf{X})=\exp(\mathcal{L}(Q^{}))</math>,<ref name=Tran2018/> as shown above. Line 123 ⟶ 129: Using the properties of expectations, the expression <math>\operatorname{E}_{q^_{-j}} [\ln p(\mathbf{Z}, \mathbf{X})]</math> can usually be simplified into a function of the fixed [[Hyperparameter (Bayesian statistics)\|hyperparameter]]s of the [[prior distribution]]s over the latent variables and of expectations (and sometimes higher [[moment (mathematics)\|moment]]s such as the [[variance]]) of latent variables not in the current partition (i.e. latent variables not included in <math>\mathbf{Z}_j</math>). This creates [[circular dependency\|circular dependencies]] between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an [[iterative]] algorithm, much like EM (the [[expectation–maximization algorithm]]), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to [[limit of a sequence\|converge]].<ref>{{cite book\|title=Convex Optimization\|first1=Stephen P.\|last1=Boyd\|first2=Lieven\|last2=Vandenberghe\|year=2004\|publisher=Cambridge University Press\|isbn=978-0-521-83378-3\|url=https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf\|access-date=October 15, 2011}}</ref> In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's [[functional dependency]] on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the [[mean]]s); sometimes expectations of squared variables (which can be related to the [[variance]] of the variables), or expectations of higher powers (i.e. higher [[moment (mathematics)\|moment]]s) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, [[nonlinear]] dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer. ==A duality formula for variational inference== Line 134 ⟶ 140: :<math> \log E_P[\exp h] = \text{sup}_{Q \ll P} \{ E_Q[h] - D_\text{KL}(Q \parallel P)\}.</math> Further, the supremum on the right-hand side is attained [[if and only if]] it holds :<math> \frac{q(\theta)}{p(\theta)} = \frac{\exp h(\theta)}{E_P[\exp h]},</math> Line 204 ⟶ 210: </math> In the above derivation, <math>C</math>, <math>C_2</math> and <math>C_3</math> refer to values that are constant with respect to <math>\mu</math>. Note that the term <math>\operatorname{E}_{\tau}[\ln p(\tau)]</math> is not a function of <math>\mu</math> and will have the same value regardless of the value of <math>\mu</math>. Hence in line 3 we can absorb it into the [[constant term]] at the end. We do the same thing in line 7. The last line is simply a quadratic polynomial in <math>\mu</math>. Since this is the logarithm of <math>q_\mu^(\mu)</math>, we can see that <math>q_\mu^(\mu)</math> itself is a [[Gaussian distribution]]. Line 534 ⟶ 540: ==External links== * [https://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms] {{Webarchive\|url=https://web.archive.org/web/20170512025952/https://www.inference.phy.cam.ac.uk/mackay/itila/ \|date=2017-05-12 }}, by [[David J.C. MacKay]] provides an introduction to variational methods (p. 422). * [https://www.robots.ox.ac.uk/~sjrob/Pubs/fox_vbtut.pdf A Tutorial on Variational Bayes]. Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, {{doi\|10.1007/s10462-011-9236-8}}. * [https://www.gatsby.ucl.ac.uk/vbayes/ Variational-Bayes Repository] A repository of research papers, software, and links related to the use of variational methods for approximate Bayesian learning up to 2003.