Variational autoencoder: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:18, 17 April 2025 edit 82.102.110.228 (talk) Stochastic gradient descend has nothing to do with taking expectations. Undid revision 1280088605 by G.S.Ray (talk) Tag: Undo ← Previous edit		Latest revision as of 21:16, 2 August 2025 edit undo TokenByToken (talk \| contribs) Extended confirmed users 1,392 edits category Tag: Visual edit
(2 intermediate revisions by 2 users not shown)
Line 12: Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the [[#Reparameterization\|reparameterization trick]], although the variance of the noise model can be learned separately.{{cn\|date=June 2024}} Although this type of model was initially designed for [[unsupervised learning]],<ref>{{cite arXiv \|last1=Dilokthanakul \|first1=Nat \|last2=Mediano \|first2=Pedro A. M. \|last3=Garnelo \|first3=Marta \|last4=Lee \|first4=Matthew C. H. \|last5=Salimbeni \|first5=Hugh \|last6=Arulkumaran \|first6=Kai \|last7=Shanahan \|first7=Murray \|title=Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders \|date=2017-01-13 \|class=cs.LG \|eprint=1611.02648}}</ref><ref>{{cite book \|last1=Hsu \|first1=Wei-Ning \|last2=Zhang \|first2=Yu \|last3=Glass \|first3=James \|title=2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) \|chapter=Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation \|date=December 2017 \|pages=16–23 \|doi=10.1109/ASRU.2017.8268911 \|arxiv=1707.06265 \|isbn=978-1-5090-4788-8 \|s2cid=22681625 \|chapter-url=https://ieeexplore.ieee.org/document/8268911}}</ref> its effectiveness has been proven for [[semi-supervised learning]]<ref>{{cite book \|last1=Ehsan Abbasnejad \|first1=M. \|last2=Dick \|first2=Anthony \|last3=van den Hengel \|first3=Anton \|title=Infinite Variational Autoencoder for Semi-Supervised Learning \|date=2017 \|pages=5888–5897 \|url=https://openaccess.thecvf.com/content_cvpr_2017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html}}</ref><ref>{{cite journal \|last1=Xu \|first1=Weidi \|last2=Sun \|first2=Haoze \|last3=Deng \|first3=Chao \|last4=Tan \|first4=Ying \|title=Variational Autoencoder for Semi-Supervised Text Classification \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|date=2017-02-12 \|volume=31 \|issue=1 \|doi=10.1609/aaai.v31i1.10966 \|s2cid=2060721 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/10966 \|language=en\|doi-access=free }}</ref> and [[supervised learning]].<ref>{{cite journal \|last1=Kameoka \|first1=Hirokazu \|last2=Li \|first2=Li \|last3=Inoue \|first3=Shota \|last4=Makino \|first4=Shoji \|title=Supervised Determined Source Separation with Multichannel Variational Autoencoder \|journal=Neural Computation \|date=2019-09-01 \|volume=31 \|issue=9 \|pages=1891–1914 \|doi=10.1162/neco_a_01217 \|pmid=31335290 \|s2cid=198168155 \|url=https://direct.mit.edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with\|url-access=subscription }}</ref> == Overview of architecture and operation == Line 69: = \ln p_\theta(x) - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta({\cdot \| x})) </math>Maximizing the ELBO<math display="block">\theta^,\phi^ = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) </math>is equivalent to simultaneously maximizing <math>\ln p_\theta(x) </math> and minimizing <math> D_{KL}(q_\phi({z\| x})\parallel p_\theta({z\| x})) </math>. That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior <math>q_\phi(\cdot \| x) </math> from the exact posterior <math>p_\theta(\cdot \| x) </math>. The form given is not very convenient for maximization, but the following, equivalent form, is:<math display="block">L_{\theta,\phi}(x) = \mathbb E_{z \sim q_\phi(\cdot \| x)} \left[\ln p_\theta(x\|z)\right] - D_{KL}(q_\phi({\cdot\| x})\parallel p_\theta(\cdot)) </math>where <math>\ln p_\theta(x\|z)</math> is implemented as <math>-\frac{1}{2}\\| x - D_\theta(z)\\|^2_2</math>, since that is, up to an additive constant, what <math>x\|z \sim \mathcal N(D_\theta(z), I)</math> yields. That is, we model the distribution of <math>x</math> conditional on <math>z</math> to be a Gaussian distribution centered on <math>D_\theta(z)</math>. The distribution of <math>q_\phi(z \|x)</math> and <math>p_\theta(z)</math> are often also chosen to be Gaussians as <math>z\|x \sim \mathcal N(E_\phi(x), \sigma_\phi(x)^2I)</math> and <math>z \sim \mathcal N(0, I)</math>, with which we obtain by the formula for [[Kullback–Leibler divergence#Multivariate normal distributions\|KL divergence of Gaussians]]:<math display="block">L_{\theta,\phi}(x) = -\frac 12\mathbb E_{z \sim q_\phi(\cdot \| x)} \left[ \\|x - D_\theta(z)\\|_2^2\right] - \frac 12 \left( N\sigma_\phi(x)^2 + \\|E_\phi(x)\\|_2^2 - 2N\ln\sigma_\phi(x) \right) + Const </math>Here <math> N </math> is the dimension of <math> z </math>. For a more detailed derivation and more interpretations of ELBO and its maximization, see [[Evidence lower bound\|its main page]]. == Reparameterization == Line 97: Some structures directly deal with the quality of the generated samples<ref>{{Cite arXiv\|last1=Dai\|first1=Bin\|last2=Wipf\|first2=David\|date=2019-10-30\|title=Diagnosing and Enhancing VAE Models\|class=cs.LG\|eprint=1903.05789}}</ref><ref>{{Cite arXiv\|last1=Dorta\|first1=Garoe\|last2=Vicente\|first2=Sara\|last3=Agapito\|first3=Lourdes\|last4=Campbell\|first4=Neill D. F.\|last5=Simpson\|first5=Ivor\|date=2018-07-31\|title=Training VAEs Under Structured Residuals\|class=stat.ML\|eprint=1804.01050}}</ref> or implement more than one latent space to further improve the representation learning. Some architectures mix VAE and [[generative adversarial network]]s to obtain hybrid models.<ref>{{Cite journal\|last1=Larsen\|first1=Anders Boesen Lindbo\|last2=Sønderby\|first2=Søren Kaae\|last3=Larochelle\|first3=Hugo\|last4=Winther\|first4=Ole\|date=2016-06-11\|title=Autoencoding beyond pixels using a learned similarity metric\|url=http://proceedings.mlr.press/v48/larsen16.html\|journal=International Conference on Machine Learning\|language=en\|publisher=PMLR\|pages=1558–1566\|arxiv=1512.09300}}</ref><ref>{{cite arXiv\|last1=Bao\|first1=Jianmin\|last2=Chen\|first2=Dong\|last3=Wen\|first3=Fang\|last4=Li\|first4=Houqiang\|last5=Hua\|first5=Gang\|date=2017\|title=CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training\|pages=2745–2754\|class=cs.CV\|eprint=1703.10155}}</ref><ref>{{Cite journal\|last1=Gao\|first1=Rui\|last2=Hou\|first2=Xingsong\|last3=Qin\|first3=Jie\|last4=Chen\|first4=Jiaxin\|last5=Liu\|first5=Li\|last6=Zhu\|first6=Fan\|last7=Zhang\|first7=Zhao\|last8=Shao\|first8=Ling\|date=2020\|title=Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning\|url=https://ieeexplore.ieee.org/document/8957359\|journal=IEEE Transactions on Image Processing\|volume=29\|pages=3665–3680\|doi=10.1109/TIP.2020.2964429\|pmid=31940538\|bibcode=2020ITIP...29.3665G\|s2cid=210334032\|issn=1941-0042\|url-access=subscription}}</ref> It is not necessary to use gradients to update the encoder. In fact, the encoder is not necessary for the generative model. <ref>{{cite book \| last1=Drefs \| first1=J. \| last2=Guiraud \| first2=E. \| last3=Panagiotou \| first3=F. \| last4=Lücke \| first4=J. \| chapter=Direct evolutionary optimization of variational autoencoders with binary latents \| title=Joint European Conference on Machine Learning and Knowledge Discovery in Databases \| series=Lecture Notes in Computer Science \| pages=357–372 \| year=2023 \| volume=13715 \| publisher=Springer Nature Switzerland \| doi=10.1007/978-3-031-26409-2_22 \| isbn=978-3-031-26408-5 \| chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-26409-2_22 }}</ref> Line 146: [[Category:Bayesian statistics]] [[Category:Dimension reduction]] [[Category:2013 in artificial intelligence]]