Variational autoencoder: Difference between revisions

Content deleted Content added
Reparameterization: some simplification
Line 54:
\end{align}</math>
 
Now define the function<math display="block">L_{\theta,\phi}(x) :=
Now define the function<math display="block">L_{\theta,\phi}(x) := \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{q_\phi(\mathbf{z\mid x})}{p_\theta(\mathbf{z,x})}\right] = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = -E_{\mathbf{z} \sim q_\phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x\mid z}))) + D_{KL}(q_\phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) </math>
\mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right]
: <math>-L_{\theta,\phi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) \leq \log (p_\theta(\mathbf{x})) </math>
= E_{\mathbf{z} \sim q_\phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x\mid z}))) - D_{KL}(q_\phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) </math>This is named the [[evidence lower bound]] (ELBO). Maximizing the ELBO<math display="block">\theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>is equivalent to simultaneously maximizing <math>p_\theta(x) </math> and minimizing <math> D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) </math>. That is, maximizing the log-likelihoof of the observed data, and minimizing the divergence of the approximate posterior <math>q_\phi(\cdot | x) </math> from the exact posterior <math>p_\theta(\cdot | x) </math>.
 
For a more detailed derivation and interpretation of ELBO and its maximization, see [[Evidence lower bound|its main page]].
At this point, it is possible to rewrite the equation as
 
: <math>\log (p_\theta(\mathbf{x})) - D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = E_{\mathbf{z} \sim q_\phi(\mathbf{z\mid x})}(\log(p_\theta(\mathbf{x\mid z}))) - D_{KL}(q_\phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z}))</math>
 
The goal is to maximize the [[log-likelihood]] of the [[Left hand side|LHS]] of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one.
 
This is equivalent to minimizing the negative log-likelihood, common practice in optimization.
 
The loss function so obtained, also named [[evidence lower bound]] loss function, shortly ELBO, can be written as
 
: <math>L_{\theta,\phi} = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = -E_{\mathbf{z} \sim q_\phi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x\mid z}))) + D_{KL}(q_\phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) </math>
 
Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that
 
: <math>-L_{\theta,\phi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) \leq \log (p_\theta(\mathbf{x})) </math>
 
The optimal parameters minimize this loss function. The problem can be summarized as
 
<math>\theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg min} \, L_{\theta,\phi} </math>
 
The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters <math>\theta </math> and <math>\phi </math>.
 
Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called '''reparameterization trick''' to remove the stochastic sampling from the formation, and thus making it differentiable.
 
== Reparameterization ==
[[File:Reparameterization Trick.png|thumb|300x300px|The scheme of the reparameterization trick. The randomness variable <math>\mathbf{\varepsilon}</math> is injected into the latent space <math>\mathbf{z}</math> as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.]]{{Main|Reparametrization trick}}
To efficient search for <math display="block">\theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>the typical method is [[gradient descent]]. However, a direct approach:<math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right] </math>does not allow one to put the <math>\nabla_\phi </math> inside the expectation, since <math>\phi </math> appears in the probability distribution itself. The '''reparameterization trick''' bypasses this difficulty.
 
To make the ELBO formulation suitable for training purposes, it is necessary to slightly modify the problem formulation and the VAE structure.<ref name=":0" /><ref>{{Cite journal|last1=Bengio|first1=Yoshua|last2=Courville|first2=Aaron|last3=Vincent|first3=Pascal|title=Representation Learning: A Review and New Perspectives|url=https://ieeexplore.ieee.org/abstract/document/6472238?casa_token=wQPK9gUGfCsAAAAA:FS5uNYCQVJGH-bq-kVvZeTdnQ8a33C6qQ4VUyDyGLMO13QewH3wcry9_Jh-5FATvspBj8YOXfw|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|year=2013|volume=35|issue=8|pages=1798–1828|doi=10.1109/TPAMI.2013.50|pmid=23787338|issn=1939-3539|arxiv=1206.5538|s2cid=393948}}</ref><ref>{{Cite arXiv|last1=Kingma|first1=Diederik P.|last2=Rezende|first2=Danilo J.|last3=Mohamed|first3=Shakir|last4=Welling|first4=Max|date=2014-10-31|title=Semi-Supervised Learning with Deep Generative Models|class=cs.LG|eprint=1406.5298}}</ref>