Variational autoencoder: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:54, 26 December 2024 edit 46.199.5.20 (talk) The vocabulary used is not pedantic, but necessary. There is explicit discussion of the different parameter groups that are updated in VAEs. Please use Simple Wikipedia. Undid revision 1263693098 by 75.104.64.170 (talk): Tag: Undo ← Previous edit		Latest revision as of 16:21, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,332 edits Add: arxiv, bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 819/967
(14 intermediate revisions by 14 users not shown)
Line 12: Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the [[#Reparameterization\|reparameterization trick]], although the variance of the noise model can be learned separately.{{cn\|date=June 2024}} Although this type of model was initially designed for [[unsupervised learning]],<ref>{{cite arXiv \|last1=Dilokthanakul \|first1=Nat \|last2=Mediano \|first2=Pedro A. M. \|last3=Garnelo \|first3=Marta \|last4=Lee \|first4=Matthew C. H. \|last5=Salimbeni \|first5=Hugh \|last6=Arulkumaran \|first6=Kai \|last7=Shanahan \|first7=Murray \|title=Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders \|date=2017-01-13 \|class=cs.LG \|eprint=1611.02648}}</ref><ref>{{cite book \|last1=Hsu \|first1=Wei-Ning \|last2=Zhang \|first2=Yu \|last3=Glass \|first3=James \|title=2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) \|chapter=Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation \|date=December 2017 \|pages=16–23 \|doi=10.1109/ASRU.2017.8268911 \|arxiv=1707.06265 \|isbn=978-1-5090-4788-8 \|s2cid=22681625 ~~\|chapter-url=https://ieeexplore.ieee.org/document/8268911~~}}</ref> its effectiveness has been proven for [[semi-supervised learning]]<ref>{{cite book \|last1=Ehsan Abbasnejad \|first1=M. \|last2=Dick \|first2=Anthony \|last3=van den Hengel \|first3=Anton \|title=Infinite Variational Autoencoder for Semi-Supervised Learning \|date=2017 \|pages=5888–5897 \|url=https://openaccess.thecvf.com/content_cvpr_2017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html}}</ref><ref>{{cite journal \|last1=Xu \|first1=Weidi \|last2=Sun \|first2=Haoze \|last3=Deng \|first3=Chao \|last4=Tan \|first4=Ying \|title=Variational Autoencoder for Semi-Supervised Text Classification \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|date=2017-02-12 \|volume=31 \|issue=1 \|doi=10.1609/aaai.v31i1.10966 \|s2cid=2060721 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/10966 \|language=en\|doi-access=free }}</ref> and [[supervised learning]].<ref>{{cite journal \|last1=Kameoka \|first1=Hirokazu \|last2=Li \|first2=Li \|last3=Inoue \|first3=Shota \|last4=Makino \|first4=Shoji \|title=Supervised Determined Source Separation with Multichannel Variational Autoencoder \|journal=Neural Computation \|date=2019-09-01 \|volume=31 \|issue=9 \|pages=1891–1914 \|doi=10.1162/neco_a_01217 \|pmid=31335290 \|s2cid=198168155 \|url=https://direct.mit.edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with\|url-access=subscription }}</ref> == Overview of architecture and operation == Line 21: To optimize this model, one needs to know two terms: the "reconstruction error", and the [[Kullback–Leibler divergence]] (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.<ref name="Kingma2013">{{cite arXiv \|last1=Kingma \|first1=Diederik P. \|last2=Welling \|first2=Max \|title=Auto-Encoding Variational Bayes \|date=2013-12-20 \|class=stat.ML \|eprint=1312.6114}}</ref> More recent approaches replace [[Kullback–Leibler divergence]] (KL-D) with [[Statistical distance\|various statistical distances]], see [[#Statistical distance VAE variants\|~~see section~~ "Statistical distance VAE variants"]] below~~.]]~~. == Formulation == Line 69: = \ln p_\theta(x) - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta({\cdot \| x})) </math>Maximizing the ELBO<math display="block">\theta^,\phi^ = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>is equivalent to simultaneously maximizing <math>\ln p_\theta(x) </math> and minimizing <math> D_{KL}(q_\phi({z\| x})\parallel p_\theta({z\| x})) </math>. That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior <math>q_\phi(\cdot \| x) </math> from the exact posterior <math>p_\theta(\cdot \| x) </math>. The form given is not very convenient for maximization, but the following, equivalent form, is:<math display="block">L_{\theta,\phi}(x) = \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln p_\theta(x\|z)\right] - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta(\cdot)) </math>where <math>\ln p_\theta(x\|z)</math> is implemented as <math>-\frac{1}{2}\\| x - D_\theta(z)\\|^2_2</math>, since that is, up to an additive constant, what <math>x\|z \sim \mathcal N(D_\theta(z), I)</math> yields. That is, we model the distribution of <math>x</math> conditional on <math>z</math> to be a Gaussian distribution centered on <math>D_\theta(z)</math>. The distribution of <math>q_\phi(z \|x)</math> and <math>p_\theta(z)</math> are often also chosen to be Gaussians as <math>z\|x \sim \mathcal N(E_\phi(x), \sigma_\phi(x)^2I)</math> and <math>z \sim \mathcal N(0, I)</math>, with which we obtain by the formula for [[Kullback–Leibler divergence#Multivariate normal distributions\|KL divergence of Gaussians]]:<math display="block">L_{\theta,\phi}(x) = -\frac 12\mathbb E_{z \sim q_\phi(\cdot \| x)} \left[ \\|x - D_\theta(z)\\|_2^2\right] - \frac 12 \left( N\sigma_\phi(x)^2 + \\|E_\phi(x)\\|_2^2 - 2N\ln\sigma_\phi(x) \right) + Const </math>Here <math> N </math> is the dimension of <math> z </math>. For a more detailed derivation and more interpretations of ELBO and its maximization, see [[Evidence lower bound\|its main page]]. == Reparameterization == Line 76: It is straightforward to find<math display="block">\nabla_\theta \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] = \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[ \nabla_\theta \ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] </math>However, <math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z\| x})}\right] </math>does not allow one to put the <math>\nabla_\phi </math> inside the expectation, since <math>\phi </math> appears in the probability distribution itself. The '''reparameterization trick''' (also known as stochastic backpropagation<ref>{{Cite journal \|last1=Rezende \|first1=Danilo Jimenez \|last2=Mohamed \|first2=Shakir \|last3=Wierstra \|first3=Daan \|date=2014-06-18 \|title=Stochastic Backpropagation and Approximate Inference in Deep Generative Models \|url=https://proceedings.mlr.press/v32/rezende14.html \|journal=International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=1278–1286\|arxiv=1401.4082 }}</ref>) bypasses this difficulty.<ref name="Kingma2013"/><ref>{{Cite journal\|last1=Bengio\|first1=Yoshua\|last2=Courville\|first2=Aaron\|last3=Vincent\|first3=Pascal\|title=Representation Learning: A Review and New Perspectives~~\|url=https://ieeexplore.ieee.org/document/6472238~~\|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence\|year=2013\|volume=35\|issue=8\|pages=1798–1828\|doi=10.1109/TPAMI.2013.50\|pmid=23787338\|issn=1939-3539\|arxiv=1206.5538\|bibcode=2013ITPAM..35.1798B \|s2cid=393948}}</ref><ref>{{Cite arXiv\|last1=Kingma\|first1=Diederik P.\|last2=Rezende\|first2=Danilo J.\|last3=Mohamed\|first3=Shakir\|last4=Welling\|first4=Max\|date=2014-10-31\|title=Semi-Supervised Learning with Deep Generative Models\|class=cs.LG\|eprint=1406.5298}}</ref> The most important example is when <math>z \sim q_\phi(\cdot \| x) </math> is normally distributed, as <math>\mathcal N(\mu_\phi(x), \Sigma_\phi(x)) </math>. Line 86: \mathbb {E}_{\epsilon}\left[ \nabla_\phi \ln {\frac {p_{\theta }(x, \mu_\phi(x) + L_\phi(x)\epsilon)}{q_{\phi }(\mu_\phi(x) + L_\phi(x)\epsilon \| x)}}\right] </math>and so we obtained an unbiased estimator of the gradient, allowing [[stochastic gradient descent]]. Since we reparametrized <math>z</math>, we need to find <math>q_\phi(z\|x)</math>. Let <math>q_0</math> be the probability density function for <math>\epsilon</math>, then {{clarify \|reason=The following calculations might have mistakes.\|date=October 2023}}<math display="block">\ln q_\phi(z \| x) = \ln q_0 (\epsilon) - \ln\|\det(\partial_\epsilon z)\|</math>where <math>\partial_\epsilon z</math> is the Jacobian matrix of <math>~~\epsilon~~z</math> with respect to <math>z\epsilon</math>. Since <math>z = \mu_\phi(x) + L_\phi(x)\epsilon </math>, this is <math display="block">\ln q_\phi(z \| x) = -\frac 12 \\|\epsilon\\|^2 - \ln\|\det L_\phi(x)\| - \frac n2 \ln(2\pi)</math> == Variations == Line 97: Some structures directly deal with the quality of the generated samples<ref>{{Cite arXiv\|last1=Dai\|first1=Bin\|last2=Wipf\|first2=David\|date=2019-10-30\|title=Diagnosing and Enhancing VAE Models\|class=cs.LG\|eprint=1903.05789}}</ref><ref>{{Cite arXiv\|last1=Dorta\|first1=Garoe\|last2=Vicente\|first2=Sara\|last3=Agapito\|first3=Lourdes\|last4=Campbell\|first4=Neill D. F.\|last5=Simpson\|first5=Ivor\|date=2018-07-31\|title=Training VAEs Under Structured Residuals\|class=stat.ML\|eprint=1804.01050}}</ref> or implement more than one latent space to further improve the representation learning. Some architectures mix VAE and [[generative adversarial network]]s to obtain hybrid models.<ref>{{Cite journal\|last1=Larsen\|first1=Anders Boesen Lindbo\|last2=Sønderby\|first2=Søren Kaae\|last3=Larochelle\|first3=Hugo\|last4=Winther\|first4=Ole\|date=2016-06-11\|title=Autoencoding beyond pixels using a learned similarity metric\|url=http://proceedings.mlr.press/v48/larsen16.html\|journal=International Conference on Machine Learning\|language=en\|publisher=PMLR\|pages=1558–1566\|arxiv=1512.09300}}</ref><ref>{{cite arXiv\|last1=Bao\|first1=Jianmin\|last2=Chen\|first2=Dong\|last3=Wen\|first3=Fang\|last4=Li\|first4=Houqiang\|last5=Hua\|first5=Gang\|date=2017\|title=CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training\|pages=2745–2754\|class=cs.CV\|eprint=1703.10155}}</ref><ref>{{Cite journal\|last1=Gao\|first1=Rui\|last2=Hou\|first2=Xingsong\|last3=Qin\|first3=Jie\|last4=Chen\|first4=Jiaxin\|last5=Liu\|first5=Li\|last6=Zhu\|first6=Fan\|last7=Zhang\|first7=Zhao\|last8=Shao\|first8=Ling\|date=2020\|title=Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning~~\|url=https://ieeexplore.ieee.org/document/8957359~~\|journal=IEEE Transactions on Image Processing\|volume=29\|pages=3665–3680\|doi=10.1109/TIP.2020.2964429\|pmid=31940538\|bibcode=2020ITIP...29.3665G\|s2cid=210334032\|issn=1941-0042}}</ref> It is not necessary to use gradients to update the encoder. In fact, the encoder is not necessary for the generative model. <ref>{{cite book \| last1=Drefs \| first1=J. \| last2=Guiraud \| first2=E. \| last3=Panagiotou \| first3=F. \| last4=Lücke \| first4=J. \| chapter=Direct evolutionary optimization of variational autoencoders with binary latents \| title=Joint European Conference on Machine Learning and Knowledge Discovery in Databases \| series=Lecture Notes in Computer Science \| pages=357–372 \| year=2023 \| volume=13715 \| publisher=Springer Nature Switzerland \| doi=10.1007/978-3-031-26409-2_22 \| arxiv=2011.13704 \| isbn=978-3-031-26408-5 }}</ref> == Statistical distance VAE variants== After the initial work of Diederik P. Kingma and [[Max Welling]].,<ref>{{Cite arXiv \|eprint=1312.6114 \|class=stat.ML \|first1=Diederik P. \|last1=Kingma \|first2=Max \|last2=Welling \|title=Auto-Encoding Variational Bayes \|date=2022-12-10}}</ref> several procedures were proposed to formulate in a more abstract way the operation of the VAE. In these approaches the loss function is composed of two parts : * the usual reconstruction error part which seeks to ensure that the encoder-then-decoder mapping <math>x \mapsto D_\theta(E_\psi(x))</math> is as close to the identity map as possible; the sampling is done at run time from the empirical distribution <math>\mathbb{P}^{real}</math> of objects available (e.g., for MNIST or IMAGENET this will be the empirical probability law of all images in the dataset). This gives the term: <math> \mathbb{E}_{x \sim \mathbb{P}^{real}} \left[ \\|x - D_\theta(E_\phi(x))\\|_2^2\right]</math>. * a variational part that ensures that, when the empirical distribution <math>\mathbb{P}^{real}</math> is passed through the encoder <math>E_\phi</math>, we recover the target distribution, denoted here <math>\mu(dz)</math> that is usually taken to be a [[Multivariate normal distribution]]. We will denote <math>E_\phi \sharp \mathbb{P}^{real}</math> this [[~~Pushforward measure\|~~pushforward measure]] which in practice is just the empirical distribution obtained by passing all dataset objects through the encoder <math> E_\phi</math>. In order to make sure that <math>E_\phi \sharp \mathbb{P}^{real}</math> is close to the target <math>\mu(dz)</math>, a [[Statistical distance]] <math>d</math> is invoked and the term <math>d \left( \mu(dz), E_\phi \sharp \mathbb{P}^{real} \right)^2 </math> is added to the loss. We obtain the final formula for the loss: Line 110 ⟶ 112: +d \left( \mu(dz), E_\phi \sharp \mathbb{P}^{real} \right)^2</math> The statistical distance <math>d</math> requires special properties, for instance isit has to be posses a formula as expectation because the loss function will need to be optimized by [[Stochastic gradient descent\|stochastic optimization algorithms]]. Several distances can be chosen and this gave rise to several flavors of VAEs: * the sliced Wasserstein distance used by S Kolouri, et al. in their VAE<ref>{{Cite conference \|last1=Kolouri \|first1=Soheil \|last2=Pope \|first2=Phillip E. \|last3=Martin \|first3=Charles E. \|last4=Rohde \|first4=Gustavo K. \|date=2019 \|title=Sliced Wasserstein Auto-Encoders \|url=https://openreview.net/forum?id=H1xaJn05FQ \|conference=International Conference on Learning Representations \|publisher=ICPR \|book-title=International Conference on Learning Representations}}</ref> * the [[~~Energy distance\|~~energy distance]] implemented in the Radon Sobolev Variational Auto-Encoder<ref>{{Cite journal \|last=Turinici \|first=Gabriel \|year=2021 \|title=Radon-Sobolev Variational Auto-Encoders \|url=https://www.sciencedirect.com/science/article/pii/S0893608021001556 \|journal=Neural Networks \|volume=141 \|pages=294–305 \|arxiv=1911.13135 \|doi=10.1016/j.neunet.2021.04.018 \|issn=0893-6080 \|pmid=33933889}}</ref> * the [[Maximum Mean Discrepancy]] distance used in the MMD-VAE<ref>{{Cite journal \|arxiv=1705.02239 \|first1=A. \|last1=Gretton \|first2=Y. \|last2=Li \|title=A Polya Contagion Model for Networks \|date=2017 \|last3=Swersky \|first3=K. \|last4=Zemel \|first4=R. \|last5=Turner \|first5=R.\|journal=IEEE Transactions on Control of Network Systems \|volume=5 \|issue=4 \|pages=1998–2010 \|doi=10.1109/TCNS.2017.2781467 }}</ref> * the [[Wasserstein distance]] used in the WAEs<ref>{{Cite arXiv \|eprint=1711.01558 \|first1=I. \|last1=Tolstikhin \|first2=O. \|last2=Bousquet \|title=Wasserstein Auto-Encoders \|date=2018 \|last3=Gelly \|first3=S. \|last4=Schölkopf \|first4=B.\|class=stat.ML }}</ref> Line 144 ⟶ 146: [[Category:Bayesian statistics]] [[Category:Dimension reduction]] [[Category:2013 in artificial intelligence]]