Variational autoencoder: Difference between revisions

Content deleted Content added
rmv redlink hatnote WP:REDHAT
cleanup, typo(s) fixed: i.e → i.e.
Line 5:
In [[machine learning]], a '''variational autoencoder (VAE)''',<ref name=":0">{{cite arXiv |last1=Kingma |first1=Diederik P. |last2=Welling |first2=Max |title=Auto-Encoding Variational Bayes |date=2014-05-01 |class=stat.ML |eprint=1312.6114}}</ref> is an [[artificial neural network]] architecture introduced by Diederik P. Kingma and [[Max Welling]], belonging to the families of [[graphical model|probabilistic graphical models]] and [[variational Bayesian methods]].
 
Variational autoencoders are often associated with the [[autoencoder]] model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders allow ''statistical inference'' problems (such as inferring the value of one [[random variable]] from another random variable) to be rewritten as ''statistical optimization'' problems (i.e. find the parameter values that minimize some objective function).<ref>{{cite journal |last1=Kramer |first1=Mark A. |title=Nonlinear principal component analysis using autoassociative neural networks |journal=AIChE Journal |date=1991 |volume=37 |issue=2 |pages=233–243 |doi=10.1002/aic.690370209 |url=https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.690370209 |language=en}}</ref><ref>{{cite journal |last1=Hinton |first1=G. E. |last2=Salakhutdinov |first2=R. R. |title=Reducing the Dimensionality of Data with Neural Networks |journal=Science |date=2006-07-28 |volume=313 |issue=5786 |pages=504–507 |doi=10.1126/science.1127647 |pmid=16873662 |bibcode=2006Sci...313..504H |s2cid=1658773 |url=https://www.science.org/doi/abs/10.1126/science.1127647?casa_token=ZLsQ9vPfFA4AAAAA%3A3iBJRtRFr9RzkbbGpAJQtghIAndmRGEPVxW-yixDgfiXqWuuaQs8WjDMf-fkzTIe8RKn_J9o1aFozD4 |language=en}}</ref> <ref>{{cite web |title=A Beginner's Guide to Variational Methods: Mean-Field Approximation |url=https://blog.evjang.com/2016/08/variational-bayes.html |website=Eric Jang |language=en |date=2016-07-08}}</ref> They are meant to map the input variable to a multivariate latent distribution. Although this type of model was initially designed for [[unsupervised learning]],<ref>{{cite arXiv |last1=Dilokthanakul |first1=Nat |last2=Mediano |first2=Pedro A. M. |last3=Garnelo |first3=Marta |last4=Lee |first4=Matthew C. H. |last5=Salimbeni |first5=Hugh |last6=Arulkumaran |first6=Kai |last7=Shanahan |first7=Murray |title=Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders |date=2017-01-13 |class=cs.LG |eprint=1611.02648}}</ref><ref>{{cite book |last1=Hsu |first1=Wei-Ning |last2=Zhang |first2=Yu |last3=Glass |first3=James |title=2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |chapter=Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation |date=December 2017 |pages=16–23 |doi=10.1109/ASRU.2017.8268911 |arxiv=1707.06265 |isbn=978-1-5090-4788-8 |s2cid=22681625 |chapter-url=https://ieeexplore.ieee.org/abstract/document/8268911?casa_token=i8S9DzueB5gAAAAA:SnZUh5mfUYtRpusQLMJxN7eC_-6-qOQs9vpkEcA0Ai_ju-nJH7o1H1DN6nDFdeCY-LgGg3OVKQ}}</ref> its effectiveness has been proven for [[semi-supervised learning]]<ref>{{cite book |last1=Ehsan Abbasnejad |first1=M. |last2=Dick |first2=Anthony |last3=van den Hengel |first3=Anton |title=Infinite Variational Autoencoder for Semi-Supervised Learning |date=2017 |pages=5888–5897 |url=https://openaccess.thecvf.com/content_cvpr_2017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html}}</ref><ref>{{cite journal |last1=Xu |first1=Weidi |last2=Sun |first2=Haoze |last3=Deng |first3=Chao |last4=Tan |first4=Ying |title=Variational Autoencoder for Semi-Supervised Text Classification |journal=Proceedings of the AAAI Conference on Artificial Intelligence |date=2017-02-12 |volume=31 |issue=1 |url=https://ojs.aaai.org/index.php/AAAI/article/view/10966 |language=en}}</ref> and [[supervised learning]].<ref>{{cite journal |last1=Kameoka |first1=Hirokazu |last2=Li |first2=Li |last3=Inoue |first3=Shota |last4=Makino |first4=Shoji |title=Supervised Determined Source Separation with Multichannel Variational Autoencoder |journal=Neural Computation |date=2019-09-01 |volume=31 |issue=9 |pages=1891–1914 |doi=10.1162/neco_a_01217 |pmid=31335290 |s2cid=198168155 |url=https://direct.mit.edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with}}</ref>
 
== Architecture ==
Line 51:
&=\ln p_\theta(x) + \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{q_\phi({z| x})}{p_\theta(x, z)}\right]
\end{align}</math>
 
 
Now define the [[evidence lower bound]] (ELBO):<math display="block">L_{\theta,\phi}(x) :=
Line 61 ⟶ 60:
== Reparameterization ==
[[File:Reparameterization Trick.png|thumb|300x300px|The scheme of the reparameterization trick. The randomness variable <math>{\varepsilon}</math> is injected into the latent space <math>z</math> as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.]]
To efficient search for <math display="block">\theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>the typical method is [[gradient descent]].
 
It is straightforward to find<math display="block">\nabla_\theta \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right]
= \mathbb E_{z \sim q_\phi(\cdot | x)} \left[ \nabla_\theta \ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] </math>However, <math display="block">\nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] </math>does not allow one to put the <math>\nabla_\phi </math> inside the expectation, since <math>\phi </math> appears in the probability distribution itself. The '''reparameterization trick''' (also known as stochastic backpropagation<ref>{{Cite journal |last=Rezende |first=Danilo Jimenez |last2=Mohamed |first2=Shakir |last3=Wierstra |first3=Daan |date=2014-06-18 |title=Stochastic Backpropagation and Approximate Inference in Deep Generative Models |url=https://proceedings.mlr.press/v32/rezende14.html |journal=International Conference on Machine Learning |language=en |publisher=PMLR |pages=1278–1286}}</ref>) bypasses this difficulty.<ref name=":0" /><ref>{{Cite journal|last1=Bengio|first1=Yoshua|last2=Courville|first2=Aaron|last3=Vincent|first3=Pascal|title=Representation Learning: A Review and New Perspectives|url=https://ieeexplore.ieee.org/abstract/document/6472238?casa_token=wQPK9gUGfCsAAAAA:FS5uNYCQVJGH-bq-kVvZeTdnQ8a33C6qQ4VUyDyGLMO13QewH3wcry9_Jh-5FATvspBj8YOXfw|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|year=2013|volume=35|issue=8|pages=1798–1828|doi=10.1109/TPAMI.2013.50|pmid=23787338|issn=1939-3539|arxiv=1206.5538|s2cid=393948}}</ref><ref>{{Cite arXiv|last1=Kingma|first1=Diederik P.|last2=Rezende|first2=Danilo J.|last3=Mohamed|first3=Shakir|last4=Welling|first4=Max|date=2014-10-31|title=Semi-Supervised Learning with Deep Generative Models|class=cs.LG|eprint=1406.5298}}</ref>
 
 
The most important example is when <math>z \sim q_\phi(\cdot | x) </math> is normally distributed, as <math>\mathcal N(\mu_\phi(x), \Sigma_\phi(x)) </math>.