Variational autoencoder: Difference between revisions

Content deleted Content added
Louen342 (talk | contribs)
Intro sentence structure + italics
notation change
Line 29:
Unfortunately, the computation of <math>p_\theta(\mathbf{x})</math> is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as
 
:<math>q_\Phiphi(\mathbf{z\mid x}) \approx p_\theta(\mathbf{z\mid x})</math>
 
with <math>\Phiphi</math> defined as the set of real values that parametrize <math>q</math>.
 
In this way, the overall problem can be easily translated into the autoencoder ___domain, in which the conditional likelihood distribution <math>p_\theta(\mathbf{x}\mid\mathbf{z})</math> is carried by the ''probabilistic decoder'', while the approximated posterior distribution <math>q_\Phiphi(\mathbf{z}\mid\mathbf{x})</math> is computed by the ''probabilistic encoder''.
 
== ELBO loss function ==
Line 40:
As in every [[deep learning]] problem, it is necessary to define a differentiable loss function in order to update the network weights through [[backpropagation]].
 
For variational autoencoders the idea is to jointly optimize the generative model parameters <math>\theta</math> to reduce the reconstruction error between the input and the output, and <math>\Phiphi</math> to make <math>q_\Phiphi(\mathbf{z\mid x})</math> as close as possible to <math>p_\theta(\mathbf{z}\mid\mathbf{x})</math>.
 
As reconstruction loss, [[mean squared error]] and [[cross entropy]] are often used.
 
As distance loss between the two distributions the reverse Kullback–Leibler divergence <math>D_{KL}(q_\Phiphi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x}))</math> is a good choice to squeeze <math>q_\Phiphi(\mathbf{z\mid x})</math> under <math>p_\theta(\mathbf{z}\mid\mathbf{x})</math>.<ref name=":0" /><ref>{{cite web |title=From Autoencoder to Beta-VAE |url=https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html |website=Lil'Log |language=en |date=2018-08-12}}</ref>
 
The distance loss just defined is expanded as
 
: <math>\begin{align}
D_{KL}(q_\Phiphi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) &= \int q_\Phiphi(\mathbf{z\mid x}) \log \frac{q_\Phiphi(\mathbf{z\mid x})}{p_\theta(\mathbf{z\mid x})} \, d\mathbf{z}\\
&= \int q_\Phiphi(\mathbf{z\mid x}) \log \frac{q_\Phiphi(\mathbf{z\mid x})p_\theta(\mathbf{x})}{p_\theta(\mathbf{z,x})} \,d\mathbf{z}\\
&= \int q_\Phiphi(\mathbf{z\mid x}) \left( \log (p_\theta(\mathbf{x})) + \log \frac{q_\Phiphi(\mathbf{z\mid x})}{p_\theta(\mathbf{z,x})}\right) d\mathbf{z}\\
&= \log (p_\theta(\mathbf{x})) + \int q_\Phiphi(\mathbf{z\mid x}) \log \frac{q_\Phiphi(\mathbf{z\mid x})}{p_\theta(\mathbf{z,x})} \,d\mathbf{z}\\
&= \log (p_\theta(\mathbf{x})) + \int q_\Phiphi(\mathbf{z\mid x}) \log \frac{q_\Phiphi(\mathbf{z\mid x})}{p_\theta(\mathbf{x\mid z})p_\theta(\mathbf{z})} \,d\mathbf{z}\\
&= \log (p_\theta(\mathbf{x})) + E_{\mathbf{z} \sim q_\Phiphi(\mathbf{z\mid x})}(\log \frac{q_\Phiphi(\mathbf{z\mid x})}{p_\theta(\mathbf{z})} - \log(p_\theta(\mathbf{x\mid z})))\\
&= \log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phiphi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) - E_{\mathbf{z} \sim q_\Phiphi(\mathbf{z\mid x})}(\log(p_\theta(\mathbf{x\mid z})))
\end{align}</math>
 
At this point, it is possible to rewrite the equation as
 
: <math>\log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phiphi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = E_{\mathbf{z} \sim q_\Phiphi(\mathbf{z\mid x})}(\log(p_\theta(\mathbf{x\mid z}))) - D_{KL}(q_\Phiphi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z}))</math>
 
The goal is to maximize the [[log-likelihood]] of the [[Left hand side|LHS]] of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one.
Line 68:
The loss function so obtained, also named [[evidence lower bound]] loss function, shortly ELBO, can be written as
 
: <math>L_{\theta,\Phiphi} = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phiphi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = -E_{\mathbf{z} \sim q_\Phiphi(\mathbf{z|x})}(\log(p_\theta(\mathbf{x\mid z}))) + D_{KL}(q_\Phiphi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) </math>
 
Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that
 
: <math>-L_{\theta,\Phiphi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phiphi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) \leq \log (p_\theta(\mathbf{x})) </math>
 
The optimal parameters minimize this loss function. The problem can be summarized as
 
<math>\theta^*,\Phiphi^* = \underset{\theta,\Phiphi}\operatorname{arg min} \, L_{\theta,\Phiphi} </math>
 
The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters <math>\theta </math> and <math>\Phiphi </math>.
 
Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called '''reparameterization trick''' to remove the stochastic sampling from the formation, and thus making it differentiable.
Line 89:
The main assumption about the latent space is that it can be considered to be a set of multivariate Gaussian distributions, and thus can be described as
 
: <math>\mathbf{z} \sim q_\Phiphi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)</math>.[[File:Reparameterized Variational Autoencoder.png|thumb|The scheme of a variational autoencoder after the reparameterization trick. |300x300px]]
 
Given <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> and <math>\odot</math> defined as the element-wise product ([[Hadamard product (matrices)]]), the reparameterization trick modifies the above equation as