Variational autoencoder: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Revision as of 09:09, 18 June 2021 edit EugenioTL (talk \| contribs) 57 edits Wikipedia:GLAM/PoliMi/2021 Tags: Removed redirect Reverted Visual edit: Switched ← Previous edit		Revision as of 21:16, 2 August 2025 edit undo TokenByToken (talk \| contribs) Extended confirmed users 1,392 edits category Tag: Visual edit Next edit →
(184 intermediate revisions by 82 users not shown)
Line 1: {{short description\|Deep learning generative model to encode data representation}} {{Use dmy dates\|date=June 2021\|cs1-dates=y}} [[File:VAE Basic.png\|thumb\|425x425px\|The basic scheme of a variational autoencoder. The model receives <math>x</math> as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces <math>{x'}</math> as similar as possible to <math>x</math>.]] {{Machine learning bar}} ~~<!-- EDIT BELOW THIS LINE -->~~ In [[machine learning]], a '''variational autoencoder''',<ref name=":0">{{cite book \|last1=Kingma \|first1=Diederik P. \|last2=Welling \|first2=Max \|title=Auto-Encoding Variational Bayes \|date=2014-05-01 \|url=https://arxiv.org/abs/1312.6114}}</ref> also known as '''VAE''', is the [[artificial neural network]] architecture introduced by [[Diederik P Kingma]] and [[Max Welling]], belonging to the families of [[graphical model\|probabilistic graphical model]]s and [[Variational Bayesian methods\|variational bayesian methods]]. In [[machine learning]], a '''variational autoencoder''' ('''VAE''') is an [[artificial neural network]] architecture introduced by Diederik P. Kingma and [[Max Welling]].<ref>{{cite arXiv \|last1=Kingma \|first1=Diederik P. \|title=Auto-Encoding Variational Bayes \|date=2022-12-10 \|last2=Welling \|first2=Max\|class=stat.ML \|eprint=1312.6114 }}</ref> It is part of the families of [[graphical model\|probabilistic graphical models]] and [[variational Bayesian methods]].<ref>{{cite book \|first1=Lucas \|last1=Pinheiro Cinelli \|first2=Matheus \|last2=Araújo Marins \|first3=Eduardo Antônio \|last3=Barros da Silva \|first4=Sérgio \|last4=Lima Netto \|display-authors=1 \|title=Variational Methods for Machine Learning with Applications to Deep Networks \|___location= \|publisher=Springer \|year=2021 \|pages=111–149 \|chapter=Variational Autoencoder \|isbn=978-3-030-70681-4 \|chapter-url=https://books.google.com/books?id=N5EtEAAAQBAJ&pg=PA111 \|doi=10.1007/978-3-030-70679-1_5 \|s2cid=240802776 }}</ref> It is often associated with the [[autoencoder]]<ref>{{cite book \|last1=Kramer \|first1=Mark A. \|title=Nonlinear principal component analysis using autoassociative neural networks \|date=1991 \|pages=233–243 \|url=https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.690370209 \|language=en}}</ref><ref>{{cite book \|last1=Hinton \|first1=G. E. \|last2=Salakhutdinov \|first2=R. R. \|title=Reducing the Dimensionality of Data with Neural Networks \|date=2006-07-28 \|pages=504–507 \|url=https://science.sciencemag.org/content/313/5786/504.abstract?casa_token=ZLsQ9vPfFA4AAAAA:3iBJRtRFr9RzkbbGpAJQtghIAndmRGEPVxW-yixDgfiXqWuuaQs8WjDMf-fkzTIe8RKn_J9o1aFozD4 \|language=en}}</ref> model because of its architectural affinity, but there are significant differences both in the goal and in the mathematical formulation. Variational autoencoders are meant to compress the input information into a constrained multivariate latent distribution ([[Code\|encoding]]) to reconstruct it as accurately as possible ([[Code\|decoding]]). Although this type of model was initially designed for [[unsupervised learning]]<ref>{{cite book \|last1=Dilokthanakul \|first1=Nat \|last2=Mediano \|first2=Pedro A. M. \|last3=Garnelo \|first3=Marta \|last4=Lee \|first4=Matthew C. H. \|last5=Salimbeni \|first5=Hugh \|last6=Arulkumaran \|first6=Kai \|last7=Shanahan \|first7=Murray \|title=Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders \|date=2017-01-13 \|url=https://arxiv.org/abs/1611.02648}}</ref><ref>{{cite book \|last1=Hsu \|first1=Wei-Ning \|last2=Zhang \|first2=Yu \|last3=Glass \|first3=James \|title=Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation \|date=December 2017 \|pages=16–23 \|url=https://ieeexplore.ieee.org/abstract/document/8268911?casa_token=i8S9DzueB5gAAAAA:SnZUh5mfUYtRpusQLMJxN7eC_-6-qOQs9vpkEcA0Ai_ju-nJH7o1H1DN6nDFdeCY-LgGg3OVKQ}}</ref>, its effectiveness has been proven in other domains of machine learning such as [[semi-supervised learning]]<ref>{{cite book \|last1=Ehsan Abbasnejad \|first1=M. \|last2=Dick \|first2=Anthony \|last3=van den Hengel \|first3=Anton \|title=Infinite Variational Autoencoder for Semi-Supervised Learning \|date=2017 \|pages=5888–5897 \|url=https://openaccess.thecvf.com/content_cvpr_2017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html}}</ref><ref>{{cite book \|last1=Xu \|first1=Weidi \|last2=Sun \|first2=Haoze \|last3=Deng \|first3=Chao \|last4=Tan \|first4=Ying \|title=Variational Autoencoder for Semi-Supervised Text Classification \|date=2017-02-12 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/10966 \|language=en}}</ref> or [[supervised learning]]<ref>{{cite book \|last1=Kameoka \|first1=Hirokazu \|last2=Li \|first2=Li \|last3=Inoue \|first3=Shota \|last4=Makino \|first4=Shoji \|title=Supervised Determined Source Separation with Multichannel Variational Autoencoder \|date=2019-09-01 \|pages=1891–1914 \|url=https://direct.mit.edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with}}</ref>. ~~== Architecture ==~~ Variational autoencoders are variational bayesian methods with a multivariate distribution as prior and a posterior, approximated by an artificial neural network, forming the so-called variational encoder-decoder structure<ref name=":2">An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. ''Special Lecture on IE'', ''2''(1).</ref><ref name="1bitVAE">{{cite arxiv\|eprint=1911.12410\|class=eess.SP\|author1=Khobahi, S.\|first2=M.\|last2=Soltanalian\|title=Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding\|date=2019}}</ref><ref>{{Cite journal\|last=Kingma\|first=Diederik P.\|last2=Welling\|first2=Max\|date=2019\|title=An Introduction to Variational Autoencoders\|url=http://arxiv.org/abs/1906.02691\|journal=Foundations and Trends® in Machine Learning\|volume=12\|issue=4\|pages=307–392\|doi=10.1561/2200000056\|issn=1935-8237}}</ref>. In addition to being seen as an [[autoencoder]] neural network architecture, variational autoencoders can also be studied within the mathematical formulation of [[variational Bayesian methods]], connecting a neural encoder network to its decoder through a probabilistic [[latent space]] (for example, as a [[multivariate Gaussian distribution]]) that corresponds to the parameters of a variational distribution. A vanilla encoder is an artificial neural network to reduce its input information into a bottleneck representation named latent space. It represents the first half of the architecture of both encoder and variational autoencoder. For the former, the output is a fixed vector of artificial neurons. For the latter, the outgoing information is compressed into a probabilistic latent space composed still by artificial neurons. However, in variational autoencoder architecture, they represent and are treated as two distinct vectors with the same dimensions, representing the vector of means and the vector of standard deviations, respectively. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the [[#Reparameterization\|reparameterization trick]], although the variance of the noise model can be learned separately.{{cn\|date=June 2024}} A vanilla decoder is still an artificial neural network thought to be the mirror architecture of the encoder. It takes as input the compressed information coming from the latent space, and then it expands it to produce an output that is as equal as possible to the encoder's input. While for an autoencoder, the decoder input is trivially a fixed-length vector of real values, for a variational autoencoder, it is necessary to introduce an intermediate step. Given the probabilistic nature of the latent space, it is possible to consider it as a multivariate Gaussian vector. With this assumption, and through the technique known as the reparametrization trick, it is possible to sample populations from this latent space and treat them precisely as a fixed-length vector of real values. Although this type of model was initially designed for [[unsupervised learning]],<ref>{{cite arXiv \|last1=Dilokthanakul \|first1=Nat \|last2=Mediano \|first2=Pedro A. M. \|last3=Garnelo \|first3=Marta \|last4=Lee \|first4=Matthew C. H. \|last5=Salimbeni \|first5=Hugh \|last6=Arulkumaran \|first6=Kai \|last7=Shanahan \|first7=Murray \|title=Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders \|date=2017-01-13 \|class=cs.LG \|eprint=1611.02648}}</ref><ref>{{cite book \|last1=Hsu \|first1=Wei-Ning \|last2=Zhang \|first2=Yu \|last3=Glass \|first3=James \|title=2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) \|chapter=Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation \|date=December 2017 \|pages=16–23 \|doi=10.1109/ASRU.2017.8268911 \|arxiv=1707.06265 \|isbn=978-1-5090-4788-8 \|s2cid=22681625 \|chapter-url=https://ieeexplore.ieee.org/document/8268911}}</ref> its effectiveness has been proven for [[semi-supervised learning]]<ref>{{cite book \|last1=Ehsan Abbasnejad \|first1=M. \|last2=Dick \|first2=Anthony \|last3=van den Hengel \|first3=Anton \|title=Infinite Variational Autoencoder for Semi-Supervised Learning \|date=2017 \|pages=5888–5897 \|url=https://openaccess.thecvf.com/content_cvpr_2017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html}}</ref><ref>{{cite journal \|last1=Xu \|first1=Weidi \|last2=Sun \|first2=Haoze \|last3=Deng \|first3=Chao \|last4=Tan \|first4=Ying \|title=Variational Autoencoder for Semi-Supervised Text Classification \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|date=2017-02-12 \|volume=31 \|issue=1 \|doi=10.1609/aaai.v31i1.10966 \|s2cid=2060721 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/10966 \|language=en\|doi-access=free }}</ref> and [[supervised learning]].<ref>{{cite journal \|last1=Kameoka \|first1=Hirokazu \|last2=Li \|first2=Li \|last3=Inoue \|first3=Shota \|last4=Makino \|first4=Shoji \|title=Supervised Determined Source Separation with Multichannel Variational Autoencoder \|journal=Neural Computation \|date=2019-09-01 \|volume=31 \|issue=9 \|pages=1891–1914 \|doi=10.1162/neco_a_01217 \|pmid=31335290 \|s2cid=198168155 \|url=https://direct.mit.edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with\|url-access=subscription }}</ref> From a systemic point of view, both the vanilla autoencoder and the variational autoencoder models receive as input a set of high dimensional data. Then they adaptively compress it into a latent space (encoding), and finally, they try to reconstruct it as accurately as possible (decoding). Given the nature of its latent space, the variational autoencoder is characterized by a slightly different objective function: it has to minimize a reconstruction [[loss function]] like the vanilla autoencoder. However, it also takes into account the [[Kullback–Leibler divergence]] between the latent space and a vector of normal Gaussians. == Overview of architecture and operation == ~~== Formulation ==~~ A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the [[Expectation–maximization algorithm\|expectation-maximization]] meta-algorithm (e.g. [[Principal_component_analysis\|probabilistic PCA]], (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational [[Posterior_probability\|posteriors]]. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. [[File:VAE Basic.jpg\|thumb\|425x425px\|The basic scheme of a variational autoencoder. The model receives <math>\mathbf{x}</math> as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces <math>\mathbf{x'}</math> as similar as possible to <math>\mathbf{x}</math>.]] From a formal perspective, given an input dataset <math>\mathbf{x}</math> characterized by an unknown probability function <math>P(\mathbf{x})</math> and a multivariate latent encoding vector <math>\mathbf{z}</math>, we want to model the data as a distribution <math>p_\theta(\mathbf{x})</math>, with <math>\theta</math> defined as the set of the network parameters. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. ~~It is possible to formalize this distribution as~~ To optimize this model, one needs to know two terms: the "reconstruction error", and the [[Kullback–Leibler divergence]] (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.<ref name="Kingma2013">{{cite arXiv \|last1=Kingma \|first1=Diederik P. \|last2=Welling \|first2=Max \|title=Auto-Encoding Variational Bayes \|date=2013-12-20 \|class=stat.ML \|eprint=1312.6114}}</ref> ~~<math>p_\theta(\mathbf{x}) = \int_{\mathbf{z}}p_\theta(\mathbf{x,z})d\mathbf{z} </math>~~ More recent approaches replace [[Kullback–Leibler divergence]] (KL-D) with [[Statistical distance\|various statistical distances]], see [[#Statistical distance VAE variants\|"Statistical distance VAE variants"]] below. where <math>p_\theta</math> is the evidence of the model's data with marginalization performed over unobserved variables and thus <math>p_\theta(\mathbf{x,z})</math> represents the [[joint distribution]] between input data and its latent representation according to the network parameters <math>\theta</math>. == Formulation == ~~According to the [[Bayes' theorem]], the equation can be rewritten as~~ From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data <math>x</math> by their chosen parameterized probability distribution <math>p_{\theta}(x) = p(x\|\theta)</math>. This distribution is usually chosen to be a Gaussian <math>N(x\|\mu,\sigma)</math> which is parameterized by <math>\mu</math> and <math>\sigma</math> respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents <math>z</math> results in intractable integrals. Let us find <math>p_\theta(x)</math> via [[Marginal distribution\|marginalizing]] over <math>z</math>. : <math>p_\theta(x) = \int_{z}p_\theta({x,z}) \, dz, </math> where <math>p_\theta({x,z})</math> represents the [[joint distribution]] under <math>p_\theta</math> of the observable data <math> x </math> and its latent representation or encoding <math> z </math>. According to the [[Chain rule (probability)\|chain rule]], the equation can be rewritten as : <math>p_\theta(~~\mathbf{~~x}) = \int_~~{\mathbf~~{z}}p_\theta(~~\mathbf~~{x\| z})p_\theta(~~\mathbf{~~z})d \~~mathbf{z}~~, dz</math> In the vanilla variational autoencoder ~~we assume~~, <math>~~\mathbf{~~z}</math> ~~with~~is ~~discrete~~usually ~~dimension~~taken ~~and~~to ~~that~~be a finite-dimensional vector of real numbers, and <math>p_\theta(~~\mathbf~~{x\|z})</math> isto be a [[Gaussian distribution,]]. ~~then~~Then <math>p_\theta(~~\mathbf{~~x})</math> is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as * Prior <math>p_\theta(~~\mathbf{~~z})</math> * Likelihood <math>p_\theta(~~\mathbf{~~x}\|~~\mathbf{~~z})</math> * Posterior <math>p_\theta(~~\mathbf{~~z}\|~~\mathbf{~~x})</math> Unfortunately, the computation of <math>p_\theta(~~\mathbf{~~z\|x})</math> is ~~very~~ expensive and in most cases ~~even~~ intractable. To speed up the calculus ~~and~~to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as :<math>q_\~~Phi~~phi(~~\mathbf~~{z\| x}) \approx p_\theta(~~\mathbf~~{z\| x})</math> with <math>\~~Phi~~phi</math> defined as the set of real values that parametrize <math>q</math>. This is sometimes called ''amortized inference'', since by "investing" in finding a good <math>q_\phi</math>, one can later infer <math>z</math> from <math>x</math> quickly without doing any integrals. In this way, the ~~overall~~ problem ~~can~~is beto ~~easily~~find ~~translated~~a ~~into~~good ~~the~~probabilistic autoencoder ~~___domain~~, in which the conditional likelihood distribution <math>p_\theta(~~\mathbf{~~x}\|~~\mathbf{~~z})</math> is ~~carried~~computed by the ''probabilistic ~~encoder~~decoder'', ~~while~~and the approximated posterior distribution <math>q_\~~Phi~~phi(~~\mathbf{~~z\|x})</math> is computed by the ''probabilistic ~~decoder~~encoder''. Parametrize the encoder as <math>E_\phi</math>, and the decoder as <math>D_\theta</math>. ~~== ELBO loss function ==~~ [[File:Reparameterization Trick.jpg\|thumb\|425x425px\|The scheme of the reparameterization trick. The randomness variable <math>\mathbf{\varepsilon}</math> is injected into the latent space <math>\mathbf{z}</math> as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.]] [[File:Reparameterized Variational Autoencoder.jpg\|thumb\|The scheme of a variational autoencoder after the reparameterization trick. The model receives <math>\mathbf{x}</math> as input. The probabilistic encoder compresses it into the latent space composed by the mean vector <math>\boldsymbol{\mu}</math> and the standard deviation vector <math>\boldsymbol{\sigma}</math>. The decoder receives as input the information sampled from the latent space <math>\mathbf{z}</math> and produces <math>\mathbf{x'}</math> as similar as possible to <math>\mathbf{x}</math>.\|425x425px]] ~~As in every [[deep learning]] problem, it is necessary to define a differentiable loss function in order to update the network weights through [[backpropagation]].~~ == Evidence lower bound (ELBO) == For variational autoencoders the idea is to jointly minimize the generative model parameters <math>\theta</math> to reduce the reconstruction error between the input and the output of the network, and <math>\Phi</math> to have <math>q_\Phi(\mathbf{z\|x})</math> as close as possible to <math>p_\theta(\mathbf{z}\|\mathbf{x})</math>. {{Main\|Evidence lower bound}} Like many [[deep learning]] approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through [[backpropagation]]. ~~As reconstruction loss [[mean squared error]] and [[cross entropy]] represent good alternatives.~~ For variational autoencoders, the idea is to jointly optimize the generative model parameters <math>\theta</math> to reduce the reconstruction error between the input and the output, and <math>\phi</math> to make <math>q_\phi({z\| x})</math> as close as possible to <math>p_\theta(z\|x)</math>. As reconstruction loss, [[mean squared error]] and [[cross entropy]] are often used. As distance loss between the two distributions the reverse Kullback–Leibler divergence <math>D_{KL}(q_\Phi(\mathbf{z\|x})\|\|p_\theta(\mathbf{z\|x}))</math> is a good choice to squeeze <math>q_\Phi(\mathbf{z\|x})</math> under <math>p_\theta(\mathbf{z}\|\mathbf{x})</math><ref name=":0" /><ref>{{cite web \|title=From Autoencoder to Beta-VAE \|url=https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html \|website=Lil'Log \|language=en \|date=2018-08-12}}</ref>. As distance loss between the two distributions the Kullback–Leibler divergence <math>D_{KL}(q_\phi({z\| x})\parallel p_\theta({z\| x}))</math> is a good choice to squeeze <math>q_\phi({z\| x})</math> under <math>p_\theta(z\|x)</math>.<ref name="Kingma2013"/><ref>{{cite news \|title=From Autoencoder to Beta-VAE \|url=https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html \|website=Lil'Log \|language=en \|date=2018-08-12}}</ref> The distance loss just defined is expanded as : <math>\begin{align} D_{KL}(q_\~~Phi~~phi(~~\mathbf~~{z\| x})\|\|\parallel p_\theta(~~\mathbf~~{z\| x})) &= \~~int~~mathbb E_{z \sim q_\~~Phi~~phi(\~~mathbf{z~~cdot \| x})} \~~log~~left[\ln \frac{q_\~~Phi~~phi(~~\mathbf{~~z\|x})}{p_\theta(~~\mathbf{~~z\|x})} d\~~mathbf{z}~~right]\\ &= \~~int~~mathbb E_{z \sim q_\~~Phi~~phi(\~~mathbf{z~~cdot \| x})} \~~log~~left[\ln \frac{q_\~~Phi~~phi(~~\mathbf~~{z\| x})p_\theta(~~\mathbf{~~x})}{p_\theta(~~\mathbf{~~x, z~~,x}~~)} d\~~mathbf{z}~~right]\\ &= \~~int~~ln q_p_\~~Phi~~theta(~~\mathbf{z\|~~x}) + \~~left(~~mathbb E_{z \~~log~~sim ~~(p_~~q_\~~theta~~phi(\~~mathbf{~~cdot \| x)}~~)) +~~ \~~log~~left[\ln \frac{q_\~~Phi~~phi(~~\mathbf~~{z\| x})}{p_\theta(~~\mathbf{~~x, z~~,x}~~)}\right~~) d\mathbf{z}\\~~] ~~&= \log (p_\theta(\mathbf{x})) + \int q_\Phi(\mathbf{z\|x}) \log \frac{q_\Phi(\mathbf{z\|x})}{p_\theta(\mathbf{z,x})} d\mathbf{z}\\~~ ~~&= \log (p_\theta(\mathbf{x})) + \int q_\Phi(\mathbf{z\|x}) \log \frac{q_\Phi(\mathbf{z\|x})}{p_\theta(\mathbf{x\|z})p_\theta(\mathbf{z})} d\mathbf{z}\\~~ ~~&= \log (p_\theta(\mathbf{x})) + E_{\mathbf{z} \sim q_\Phi(\mathbf{z\|x})}(\log \frac{q_\Phi(\mathbf{z\|x})}{p_\theta(\mathbf{z})} - \log(p_\theta(\mathbf{x\|z})))\\~~ ~~&= \log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phi(\mathbf{z\|x}) \|\| p_\theta(\mathbf{z})) - E_{\mathbf{z} \sim q_\Phi(\mathbf{z\|x})}(\log(p_\theta(\mathbf{x\|z})))~~ \end{align}</math> Now define the [[evidence lower bound]] (ELBO):<math display="block">L_{\theta,\phi}(x) := ~~At this point, it is possible to rewrite the equation as~~ \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] = \ln p_\theta(x) - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta({\cdot \| x})) </math>Maximizing the ELBO<math display="block">\theta^,\phi^ = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>is equivalent to simultaneously maximizing <math>\ln p_\theta(x) </math> and minimizing <math> D_{KL}(q_\phi({z\| x})\parallel p_\theta({z\| x})) </math>. That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior <math>q_\phi(\cdot \| x) </math> from the exact posterior <math>p_\theta(\cdot \| x) </math>. The form given is not very convenient for maximization, but the following, equivalent form, is:<math display="block">L_{\theta,\phi}(x) = \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln p_\theta(x\|z)\right] - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta(\cdot)) </math>where <math>\ln p_\theta(x\|z)</math> is implemented as <math>-\frac{1}{2}\\| x - D_\theta(z)\\|^2_2</math>, since that is, up to an additive constant, what <math>x\|z \sim \mathcal N(D_\theta(z), I)</math> yields. That is, we model the distribution of <math>x</math> conditional on <math>z</math> to be a Gaussian distribution centered on <math>D_\theta(z)</math>. The distribution of <math>q_\phi(z \|x)</math> and <math>p_\theta(z)</math> are often also chosen to be Gaussians as <math>z\|x \sim \mathcal N(E_\phi(x), \sigma_\phi(x)^2I)</math> and <math>z \sim \mathcal N(0, I)</math>, with which we obtain by the formula for [[Kullback–Leibler divergence#Multivariate normal distributions\|KL divergence of Gaussians]]:<math display="block">L_{\theta,\phi}(x) = -\frac 12\mathbb E_{z \sim q_\phi(\cdot \| x)} \left[ \\|x - D_\theta(z)\\|_2^2\right] - \frac 12 \left( N\sigma_\phi(x)^2 + \\|E_\phi(x)\\|_2^2 - 2N\ln\sigma_\phi(x) \right) + Const </math>Here <math> N </math> is the dimension of <math> z </math>. For a more detailed derivation and more interpretations of ELBO and its maximization, see [[Evidence lower bound\|its main page]]. <math>\log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phi(\mathbf{z\|x})\|\|p_\theta(\mathbf{z\|x})) = E_{\mathbf{z} \sim q_\Phi(\mathbf{z\|x})}(\log(p_\theta(\mathbf{x\|z}))) - D_{KL}(q_\Phi(\mathbf{z\|x}) \|\| p_\theta(\mathbf{z}))</math> == Reparameterization == The goal is to maximize the [[log-likelihood]] of the LHS of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one. [[File:Reparameterization Trick.png\|thumb\|300x300px\|The scheme of the reparameterization trick. The randomness variable <math>{\varepsilon}</math> is injected into the latent space <math>z</math> as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.]] To efficiently search for <math display="block">\theta^,\phi^ = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>the typical method is [[gradient ascent]]. It is straightforward to find<math display="block">\nabla_\theta \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] ~~This is equivalent to minimize the negative log-likelihood, which is a common practice in optimization problems.~~ = \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[ \nabla_\theta \ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] </math>However, <math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] </math>does not allow one to put the <math>\nabla_\phi </math> inside the expectation, since <math>\phi </math> appears in the probability distribution itself. The '''reparameterization trick''' (also known as stochastic backpropagation<ref>{{Cite journal \|last1=Rezende \|first1=Danilo Jimenez \|last2=Mohamed \|first2=Shakir \|last3=Wierstra \|first3=Daan \|date=2014-06-18 \|title=Stochastic Backpropagation and Approximate Inference in Deep Generative Models \|url=https://proceedings.mlr.press/v32/rezende14.html \|journal=International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=1278–1286\|arxiv=1401.4082 }}</ref>) bypasses this difficulty.<ref name="Kingma2013"/><ref>{{Cite journal\|last1=Bengio\|first1=Yoshua\|last2=Courville\|first2=Aaron\|last3=Vincent\|first3=Pascal\|title=Representation Learning: A Review and New Perspectives\|url=https://ieeexplore.ieee.org/document/6472238\|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence\|year=2013\|volume=35\|issue=8\|pages=1798–1828\|doi=10.1109/TPAMI.2013.50\|pmid=23787338\|issn=1939-3539\|arxiv=1206.5538\|s2cid=393948}}</ref><ref>{{Cite arXiv\|last1=Kingma\|first1=Diederik P.\|last2=Rezende\|first2=Danilo J.\|last3=Mohamed\|first3=Shakir\|last4=Welling\|first4=Max\|date=2014-10-31\|title=Semi-Supervised Learning with Deep Generative Models\|class=cs.LG\|eprint=1406.5298}}</ref> The most important example is when <math>z \sim q_\phi(\cdot \| x) </math> is normally distributed, as <math>\mathcal N(\mu_\phi(x), \Sigma_\phi(x)) </math>. ~~The loss function so obtained, also named '''evidence lower bound''' loss function, shortly '''ELBO''', can be written as~~ : [[File:Reparameterized Variational Autoencoder.png\|thumb\|The scheme of a variational autoencoder after the reparameterization trick \|300x300px]] <math>L_{\theta,\Phi} = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\Phi(\mathbf{z\|x})\|\|p_\theta(\mathbf{z\|x})) = -E_{\mathbf{z} \sim q_\Phi(\mathbf{z\|x})}(\log(p_\theta(\mathbf{x\|z}))) + D_{KL}(q_\Phi(\mathbf{z\|x}) \|\| p_\theta(\mathbf{z})) </math> This can be reparametrized by letting <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> be a "standard [[Random number generation\|random number generator]]", and construct <math>z </math> as <math>z = \mu_\phi(x) + L_\phi(x)\epsilon </math>. Here, <math>L_\phi(x) </math> is obtained by the [[Cholesky decomposition]]:<math display="block">\Sigma_\phi(x) = L_\phi(x)L_\phi(x)^T </math>Then we have<math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] ~~Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that~~ = \mathbb {E}_{\epsilon}\left[ \nabla_\phi \ln {\frac {p_{\theta }(x, \mu_\phi(x) + L_\phi(x)\epsilon)}{q_{\phi }(\mu_\phi(x) + L_\phi(x)\epsilon \| x)}}\right] </math>and so we obtained an unbiased estimator of the gradient, allowing [[stochastic gradient descent]]. Since we reparametrized <math>z</math>, we need to find <math>q_\phi(z\|x)</math>. Let <math>q_0</math> be the probability density function for <math>\epsilon</math>, then {{clarify \|reason=The following calculations might have mistakes.\|date=October 2023}}<math display="block">\ln q_\phi(z \| x) = \ln q_0 (\epsilon) - \ln\|\det(\partial_\epsilon z)\|</math>where <math>\partial_\epsilon z</math> is the Jacobian matrix of <math>z</math> with respect to <math>\epsilon</math>. Since <math>z = \mu_\phi(x) + L_\phi(x)\epsilon </math>, this is <math display="block">\ln q_\phi(z \| x) = -\frac 12 \\|\epsilon\\|^2 - \ln\|\det L_\phi(x)\| - \frac n2 \ln(2\pi)</math> ~~<math>-L_{\theta,\Phi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\Phi(\mathbf{z\|x})\|\|p_\theta(\mathbf{z\|x})) \leq \log (p_\theta(\mathbf{x})) </math>~~ == Variations == ~~The optimal parameters are the ones that minimize this loss function. The problem can be summarized as~~ Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance. <math>\beta</math>-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for <math>\beta</math> values greater than one. This architecture can discover disentangled latent factors without supervision.<ref>{{Cite conference\|last1=Higgins\|first1=Irina\|last2=Matthey\|first2=Loic\|last3=Pal\|first3=Arka\|last4=Burgess\|first4=Christopher\|last5=Glorot\|first5=Xavier\|last6=Botvinick\|first6=Matthew\|last7=Mohamed\|first7=Shakir\|last8=Lerchner\|first8=Alexander\|date=2016-11-04\|title=beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework\|url=https://openreview.net/forum?id=Sy2fzU9gl\|language=en\|conference=NeurIPS}}</ref><ref>{{Cite arXiv\|last1=Burgess\|first1=Christopher P.\|last2=Higgins\|first2=Irina\|last3=Pal\|first3=Arka\|last4=Matthey\|first4=Loic\|last5=Watters\|first5=Nick\|last6=Desjardins\|first6=Guillaume\|last7=Lerchner\|first7=Alexander\|date=2018-04-10\|title=Understanding disentangling in β-VAE\|class=stat.ML\|eprint=1804.03599}}</ref> ~~<math>\theta^,\Phi^ = \underset{\theta,\Phi}{arg min} L_{\theta,\Phi} </math>~~ The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.<ref>{{Cite conference\|last1=Sohn\|first1=Kihyuk\|last2=Lee\|first2=Honglak\|last3=Yan\|first3=Xinchen\|date=2015-01-01\|title=Learning Structured Output Representation using Deep Conditional Generative Models\|url=https://proceedings.neurips.cc/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf\|language=en\|conference=NeurIPS}}</ref> ~~The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters <math>\theta </math> and <math>\Phi </math>.~~ Some structures directly deal with the quality of the generated samples<ref>{{Cite arXiv\|last1=Dai\|first1=Bin\|last2=Wipf\|first2=David\|date=2019-10-30\|title=Diagnosing and Enhancing VAE Models\|class=cs.LG\|eprint=1903.05789}}</ref><ref>{{Cite arXiv\|last1=Dorta\|first1=Garoe\|last2=Vicente\|first2=Sara\|last3=Agapito\|first3=Lourdes\|last4=Campbell\|first4=Neill D. F.\|last5=Simpson\|first5=Ivor\|date=2018-07-31\|title=Training VAEs Under Structured Residuals\|class=stat.ML\|eprint=1804.01050}}</ref> or implement more than one latent space to further improve the representation learning. Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called '''reparameterization trick''' to remove the stochastic sampling from the formation, and thus making it differentiable. Some architectures mix VAE and [[generative adversarial network]]s to obtain hybrid models.<ref>{{Cite journal\|last1=Larsen\|first1=Anders Boesen Lindbo\|last2=Sønderby\|first2=Søren Kaae\|last3=Larochelle\|first3=Hugo\|last4=Winther\|first4=Ole\|date=2016-06-11\|title=Autoencoding beyond pixels using a learned similarity metric\|url=http://proceedings.mlr.press/v48/larsen16.html\|journal=International Conference on Machine Learning\|language=en\|publisher=PMLR\|pages=1558–1566\|arxiv=1512.09300}}</ref><ref>{{cite arXiv\|last1=Bao\|first1=Jianmin\|last2=Chen\|first2=Dong\|last3=Wen\|first3=Fang\|last4=Li\|first4=Houqiang\|last5=Hua\|first5=Gang\|date=2017\|title=CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training\|pages=2745–2754\|class=cs.CV\|eprint=1703.10155}}</ref><ref>{{Cite journal\|last1=Gao\|first1=Rui\|last2=Hou\|first2=Xingsong\|last3=Qin\|first3=Jie\|last4=Chen\|first4=Jiaxin\|last5=Liu\|first5=Li\|last6=Zhu\|first6=Fan\|last7=Zhang\|first7=Zhao\|last8=Shao\|first8=Ling\|date=2020\|title=Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning\|url=https://ieeexplore.ieee.org/document/8957359\|journal=IEEE Transactions on Image Processing\|volume=29\|pages=3665–3680\|doi=10.1109/TIP.2020.2964429\|pmid=31940538\|bibcode=2020ITIP...29.3665G\|s2cid=210334032\|issn=1941-0042\|url-access=subscription}}</ref> ~~== Reparameterization trick ==~~ It is not necessary to use gradients to update the encoder. In fact, the encoder is not necessary for the generative model. <ref>{{cite book \| last1=Drefs \| first1=J. \| last2=Guiraud \| first2=E. \| last3=Panagiotou \| first3=F. \| last4=Lücke \| first4=J. \| chapter=Direct evolutionary optimization of variational autoencoders with binary latents \| title=Joint European Conference on Machine Learning and Knowledge Discovery in Databases \| series=Lecture Notes in Computer Science \| pages=357–372 \| year=2023 \| volume=13715 \| publisher=Springer Nature Switzerland \| doi=10.1007/978-3-031-26409-2_22 \| isbn=978-3-031-26408-5 \| chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-26409-2_22 }}</ref> To make the ELBO formulation suitable for training purposes, it is necessary to introduce a further minor modification to the formulation of the problem and as well as to the structure of the variational autoencoder<ref name=":0" /><ref>{{Cite journal\|last=Bengio\|first=Yoshua\|last2=Courville\|first2=Aaron\|last3=Vincent\|first3=Pascal\|title=Representation Learning: A Review and New Perspectives\|url=https://ieeexplore.ieee.org/abstract/document/6472238?casa_token=wQPK9gUGfCsAAAAA:FS5uNYCQVJGH-bq-kVvZeTdnQ8a33C6qQ4VUyDyGLMO13QewH3wcry9_Jh-5FATvspBj8YOXfw\|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence\|volume=35\|issue=8\|pages=1798–1828\|doi=10.1109/TPAMI.2013.50\|issn=1939-3539}}</ref><ref>{{Cite journal\|last=Kingma\|first=Diederik P.\|last2=Rezende\|first2=Danilo J.\|last3=Mohamed\|first3=Shakir\|last4=Welling\|first4=Max\|date=2014-10-31\|title=Semi-Supervised Learning with Deep Generative Models\|url=http://arxiv.org/abs/1406.5298\|journal=arXiv:1406.5298 [cs, stat]}}</ref>. == Statistical distance VAE variants== Stochastic sampling is the non-differentiable operation with which it is possible to sample from the latent space and feed the probabilistic decoder. In order to make it feasible the application of backpropagation processes, such as the [[stochastic gradient descent]], the reparameterization trick is introduced. After the initial work of Diederik P. Kingma and [[Max Welling]],<ref>{{Cite arXiv \|eprint=1312.6114 \|class=stat.ML \|first1=Diederik P. \|last1=Kingma \|first2=Max \|last2=Welling \|title=Auto-Encoding Variational Bayes \|date=2022-12-10}}</ref> several procedures were ~~The main assumption about the latent space is that it can be considered as a set of multivariate Gaussian distributions, and thus can be described as~~ proposed to formulate in a more abstract way the operation of the VAE. In these approaches the loss function is composed of two parts : * the usual reconstruction error part which seeks to ensure that the encoder-then-decoder mapping <math>x \mapsto D_\theta(E_\psi(x))</math> is as close to the identity map as possible; the sampling is done at run time from the empirical distribution <math>\mathbb{P}^{real}</math> of objects available (e.g., for MNIST or IMAGENET this will be the empirical probability law of all images in the dataset). This gives the term: <math> \mathbb{E}_{x \sim \mathbb{P}^{real}} \left[ \\|x - D_\theta(E_\phi(x))\\|_2^2\right]</math>. * a variational part that ensures that, when the empirical distribution <math>\mathbb{P}^{real}</math> is passed through the encoder <math>E_\phi</math>, we recover the target distribution, denoted here <math>\mu(dz)</math> that is usually taken to be a [[Multivariate normal distribution]]. We will denote <math>E_\phi \sharp \mathbb{P}^{real}</math> this [[pushforward measure]] which in practice is just the empirical distribution obtained by passing all dataset objects through the encoder <math> E_\phi</math>. In order to make sure that <math>E_\phi \sharp \mathbb{P}^{real}</math> is close to the target <math>\mu(dz)</math>, a [[Statistical distance]] <math>d</math> is invoked and the term <math>d \left( \mu(dz), E_\phi \sharp \mathbb{P}^{real} \right)^2 </math> is added to the loss. We obtain the final formula for the loss: ~~<math>\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)</math>~~ <math display="block"> L_{\theta,\phi} = \mathbb{E}_{x \sim \mathbb{P}^{real}} \left[ \\|x - D_\theta(E_\phi(x))\\|_2^2\right] +d \left( \mu(dz), E_\phi \sharp \mathbb{P}^{real} \right)^2</math> The statistical distance <math>d</math> requires special properties, for instance it has to be posses a formula as expectation because the loss function will need to be optimized by [[Stochastic gradient descent\|stochastic optimization algorithms]]. Several distances can be chosen and this gave rise to several flavors of VAEs: ~~Given <math>\boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> and <math>\odot</math> defined as the element-wise product, the reparameterization trick modifies the above equation as~~ * the sliced Wasserstein distance used by S Kolouri, et al. in their VAE<ref>{{Cite conference \|last1=Kolouri \|first1=Soheil \|last2=Pope \|first2=Phillip E. \|last3=Martin \|first3=Charles E. \|last4=Rohde \|first4=Gustavo K. \|date=2019 \|title=Sliced Wasserstein Auto-Encoders \|url=https://openreview.net/forum?id=H1xaJn05FQ \|conference=International Conference on Learning Representations \|publisher=ICPR \|book-title=International Conference on Learning Representations}}</ref> * the [[energy distance]] implemented in the Radon Sobolev Variational Auto-Encoder<ref>{{Cite journal \|last=Turinici \|first=Gabriel \|year=2021 \|title=Radon-Sobolev Variational Auto-Encoders \|url=https://www.sciencedirect.com/science/article/pii/S0893608021001556 \|journal=Neural Networks \|volume=141 \|pages=294–305 \|arxiv=1911.13135 \|doi=10.1016/j.neunet.2021.04.018 \|issn=0893-6080 \|pmid=33933889}}</ref> ~~<math>\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} </math>~~ * the [[Maximum Mean Discrepancy]] distance used in the MMD-VAE<ref>{{Cite journal \|arxiv=1705.02239 \|first1=A. \|last1=Gretton \|first2=Y. \|last2=Li \|title=A Polya Contagion Model for Networks \|date=2017 \|last3=Swersky \|first3=K. \|last4=Zemel \|first4=R. \|last5=Turner \|first5=R.\|journal=IEEE Transactions on Control of Network Systems \|volume=5 \|issue=4 \|pages=1998–2010 \|doi=10.1109/TCNS.2017.2781467 }}</ref> * the [[Wasserstein distance]] used in the WAEs<ref>{{Cite arXiv \|eprint=1711.01558 \|first1=I. \|last1=Tolstikhin \|first2=O. \|last2=Bousquet \|title=Wasserstein Auto-Encoders \|date=2018 \|last3=Gelly \|first3=S. \|last4=Schölkopf \|first4=B.\|class=stat.ML }}</ref> Thanks to this transformation, that can be extended also to other distributions different from the Gaussian, the variational autoencoder is trainable and the probabilistic encoder has to learn how to map a compressed representation of the input into the two latent vectors <math>\boldsymbol{\mu} </math> and <math>\boldsymbol{\sigma} </math>, while the stochasticity remains out from the updating process and is injected in the latent space as an external input through the random vector <math>\boldsymbol{\epsilon} </math>. * kernel-based distances used in the Kernelized Variational Autoencoder (K-VAE)<ref>{{Cite arXiv \|eprint=1901.02401 \|first1=C. \|last1=Louizos \|first2=X. \|last2=Shi \|title=Kernelized Variational Autoencoders \|date=2019 \|last3=Swersky \|first3=K. \|last4=Li \|first4=Y. \|last5=Welling \|first5=M.\|class=astro-ph.CO }}</ref> ~~== Applications ==~~ ~~There are many variational autoencoders applications and extensions in order to adapt the architecture to different domains and improve its performance.~~ <math>\beta</math>-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for <math>\beta</math> values greater than one. The authors demonstrate this architecture ability to generate high-quality synthetic samples<ref>{{Cite journal\|last=Higgins\|first=Irina\|last2=Matthey\|first2=Loic\|last3=Pal\|first3=Arka\|last4=Burgess\|first4=Christopher\|last5=Glorot\|first5=Xavier\|last6=Botvinick\|first6=Matthew\|last7=Mohamed\|first7=Shakir\|last8=Lerchner\|first8=Alexander\|date=2016-11-04\|title=beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework\|url=https://openreview.net/forum?id=Sy2fzU9gl\|language=en}}</ref><ref>{{Cite journal\|last=Burgess\|first=Christopher P.\|last2=Higgins\|first2=Irina\|last3=Pal\|first3=Arka\|last4=Matthey\|first4=Loic\|last5=Watters\|first5=Nick\|last6=Desjardins\|first6=Guillaume\|last7=Lerchner\|first7=Alexander\|date=2018-04-10\|title=Understanding disentangling in $\beta$-VAE\|url=http://arxiv.org/abs/1804.03599\|journal=arXiv:1804.03599 [cs, stat]}}</ref>. One other implementation named conditional variational autoencoder, shortly CVAE, is thought to insert label information in the latent space so to force a deterministic constrained representation of the learned data<ref>{{Cite journal\|last=Sohn\|first=Kihyuk\|last2=Lee\|first2=Honglak\|last3=Yan\|first3=Xinchen\|date=2015-01-01\|title=Learning Structured Output Representation using Deep Conditional Generative Models\|url=https://openreview.net/forum?id=rJWXGDWd-H\|language=en}}</ref>. Some structures directly deal with the quality of the generated samples<ref>{{Cite journal\|last=Dai\|first=Bin\|last2=Wipf\|first2=David\|date=2019-10-30\|title=Diagnosing and Enhancing VAE Models\|url=http://arxiv.org/abs/1903.05789\|journal=arXiv:1903.05789 [cs, stat]}}</ref><ref>{{Cite journal\|last=Dorta\|first=Garoe\|last2=Vicente\|first2=Sara\|last3=Agapito\|first3=Lourdes\|last4=Campbell\|first4=Neill D. F.\|last5=Simpson\|first5=Ivor\|date=2018-07-31\|title=Training VAEs Under Structured Residuals\|url=http://arxiv.org/abs/1804.01050\|journal=arXiv:1804.01050 [cs, stat]}}</ref> or implement more than one latent space to further improve the representation learning<ref>{{Cite journal\|last=Tomczak\|first=Jakub\|last2=Welling\|first2=Max\|date=2018-03-31\|title=VAE with a VampPrior\|url=http://proceedings.mlr.press/v84/tomczak18a.html\|journal=International Conference on Artificial Intelligence and Statistics\|language=en\|publisher=PMLR\|pages=1214–1223}}</ref><ref>{{Cite journal\|last=Razavi\|first=Ali\|last2=Oord\|first2=Aaron van den\|last3=Vinyals\|first3=Oriol\|date=2019-06-02\|title=Generating Diverse High-Fidelity Images with VQ-VAE-2\|url=http://arxiv.org/abs/1906.00446\|journal=arXiv:1906.00446 [cs, stat]}}</ref>. Some architectures mix the structures of variational autoencoders and [[Generative adversarial network\|generative adversarial networks]] to obtain hybrid models with high generative capabilities<ref>{{Cite journal\|last=Larsen\|first=Anders Boesen Lindbo\|last2=Sønderby\|first2=Søren Kaae\|last3=Larochelle\|first3=Hugo\|last4=Winther\|first4=Ole\|date=2016-06-11\|title=Autoencoding beyond pixels using a learned similarity metric\|url=http://proceedings.mlr.press/v48/larsen16.html\|journal=International Conference on Machine Learning\|language=en\|publisher=PMLR\|pages=1558–1566}}</ref><ref>{{Cite journal\|last=Bao\|first=Jianmin\|last2=Chen\|first2=Dong\|last3=Wen\|first3=Fang\|last4=Li\|first4=Houqiang\|last5=Hua\|first5=Gang\|date=2017\|title=CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training\|url=https://openaccess.thecvf.com/content_iccv_2017/html/Bao_CVAE-GAN_Fine-Grained_Image_ICCV_2017_paper.html\|pages=2745–2754}}</ref><ref>{{Cite journal\|last=Gao\|first=Rui\|last2=Hou\|first2=Xingsong\|last3=Qin\|first3=Jie\|last4=Chen\|first4=Jiaxin\|last5=Liu\|first5=Li\|last6=Zhu\|first6=Fan\|last7=Zhang\|first7=Zhao\|last8=Shao\|first8=Ling\|date=2020\|title=Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning\|url=https://ieeexplore.ieee.org/abstract/document/8957359?casa_token=d6k1X5ClbTsAAAAA:AiOSfZQ7S3EsfIaecikiuLX8Y9-Lf5FHqTFRjL-FMQQ8bNjdW2rD0UZxA0BC4gVMO0QjF_YXkw\|journal=IEEE Transactions on Image Processing\|volume=29\|pages=3665–3680\|doi=10.1109/TIP.2020.2964429\|issn=1941-0042}}</ref>. == See also == Line 122 ⟶ 124: * [[Artificial neural network]] * [[Deep learning]] * [[Generative ~~Adversarial~~adversarial ~~Network~~network]] * [[Representation learning]] * [[Sparse dictionary learning]] Line 130 ⟶ 132: == References == {{reflist}} ~~<references />~~ == Further reading == * {{cite journal \|last1=Kingma \|first1=Diederik P. \|last2=Welling \|first2=Max \|year=2019 \|title=An Introduction to Variational Autoencoders \|journal=Foundations and Trends in Machine Learning \|publisher=Now Publishers \|volume=12 \|issue=4 \|pages=307–392 \|doi=10.1561/2200000056 \|arxiv=1906.02691 \|issn=1935-8237}} {{Artificial intelligence navbox}} [[Category:Neural network architectures]] [[Category:Unsupervised learning]] [[Category:Supervised learning]] [[Category:Graphical models]] [[Category:Bayesian statistics]] [[Category:Dimension reduction]] [[Category:2013 in artificial intelligence]]