Variational autoencoder: Difference between revisions

Content deleted Content added
Line 63:
== Reparameterization ==
[[File:Reparameterization Trick.png|thumb|300x300px|The scheme of the reparameterization trick. The randomness variable <math>\mathbf{\varepsilon}</math> is injected into the latent space <math>\mathbf{z}</math> as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.]]{{Main|Reparametrization trick}}
To efficient search for <math display="block">\theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>the typical method is [[gradient descent]]. However, a direct approach:<math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right] </math>does not allow one to put the <math>\nabla_\phi </math> inside the expectation, since <math>\phi </math> appears in the probability distribution itself. The '''reparameterization trick''' bypasses this difficulty.
 
It is straightforward to find<math display="block">\nabla_\theta \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right]
To= make\mathbb theE_{z ELBO\sim formulationq_\phi(\cdot suitable| forx)} training\left[ purposes\nabla_\theta \log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid itx})}\right] is necessary</math>However, <math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right] </math>does not allow one to slightlyput modifythe <math>\nabla_\phi </math> inside the problemexpectation, formulationsince and<math>\phi </math> appears in the VAEprobability distribution itself. The '''reparameterization trick''' (also known as stochastic backpropagation<ref>{{Cite journal |last=Rezende |first=Danilo Jimenez |last2=Mohamed |first2=Shakir |last3=Wierstra |first3=Daan |date=2014-06-18 |title=Stochastic Backpropagation and Approximate Inference in Deep Generative Models |url=https://proceedings.mlr.press/v32/rezende14.html |journal=International Conference on Machine Learning |language=en |publisher=PMLR |pages=1278–1286}}</ref>) bypasses this structuredifficulty.<ref name=":0" /><ref>{{Cite journal|last1=Bengio|first1=Yoshua|last2=Courville|first2=Aaron|last3=Vincent|first3=Pascal|title=Representation Learning: A Review and New Perspectives|url=https://ieeexplore.ieee.org/abstract/document/6472238?casa_token=wQPK9gUGfCsAAAAA:FS5uNYCQVJGH-bq-kVvZeTdnQ8a33C6qQ4VUyDyGLMO13QewH3wcry9_Jh-5FATvspBj8YOXfw|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|year=2013|volume=35|issue=8|pages=1798–1828|doi=10.1109/TPAMI.2013.50|pmid=23787338|issn=1939-3539|arxiv=1206.5538|s2cid=393948}}</ref><ref>{{Cite arXiv|last1=Kingma|first1=Diederik P.|last2=Rezende|first2=Danilo J.|last3=Mohamed|first3=Shakir|last4=Welling|first4=Max|date=2014-10-31|title=Semi-Supervised Learning with Deep Generative Models|class=cs.LG|eprint=1406.5298}}</ref>
 
Stochastic sampling is the non-differentiable operation through which it is possible to sample from the latent space and feed the probabilistic decoder.
 
The most important example is when <math>z \sim q_\phi(\cdot | x) </math> is normally distributed, as <math>\mathcal N(\mu_\phi(x), \Sigma_\phi(x)) </math>.
The main assumption about the latent space is that it can be considered to be a set of multivariate Gaussian distributions, and thus can be described as
 
: <math>\mathbf{z} \sim q_\phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)</math>.[[File:Reparameterized Variational Autoencoder.png|thumb|The scheme of a variational autoencoder after the reparameterization trick. |300x300px]]
 
This can be reparametrized by letting <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> be a "standard [[Random number generation|random number generator]]", and construct <math>z </math> as <math>z = \mu_\phi(x) + L_\phi(x)\epsilon </math>. Here, <math>L_\phi(x) </math> is obtained by the [[Cholesky decomposition]]:<math display="block">\Sigma_\phi(x) = L_\phi(x)L_\phi(x)^T </math>Then we have<math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\log \frac{p_\theta(\mathbf{z,x})}{q_\phi(\mathbf{z\mid x})}\right]
Given <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> and <math>\odot</math> defined as the element-wise product ([[Hadamard product (matrices)]]), the reparameterization trick modifies the above equation as
=
 
\mathbb {E}_{\epsilon}\left[ \nabla_\phi \log {\frac {p_{\theta }(x, \mu_\phi(x) + L_\phi(x)\epsilon)}{q_{\phi }(\mu_\phi(x) + L_\phi(x)\epsilon | x)}}\right] </math>and so we obtained an unbiased estimator of the gradient, allowing [[stochastic gradient descent]].
: <math>\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}. </math>
 
Thanks to this transformation (which can be extended to non-Gaussian distributions), the VAE becomes trainable and the probabilistic encoder has to learn how to map a compressed representation of the input into the two latent vectors <math>\boldsymbol{\mu} </math> and <math>\boldsymbol{\sigma} </math>, while the stochasticity remains excluded from the updating process and is injected in the latent space as an external input through the random vector <math>\boldsymbol{\varepsilon} </math>.
 
== Variations ==