Variational autoencoder

This is an old revision of this page, as edited by MB (talk | contribs) at 18:16, 21 July 2022 (rmv redlink hatnote WP:REDHAT). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In machine learning, a variational autoencoder (VAE),[1] is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.

Variational autoencoders are often associated with the autoencoder model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders allow statistical inference problems (such as inferring the value of one random variable from another random variable) to be rewritten as statistical optimization problems (i.e find the parameter values that minimize some objective function).[2][3] [4] They are meant to map the input variable to a multivariate latent distribution. Although this type of model was initially designed for unsupervised learning,[5][6] its effectiveness has been proven for semi-supervised learning[7][8] and supervised learning.[9]

Architecture

In a VAE the input data is sampled from a parametrized distribution (the prior, in Bayesian inference terms), and the encoder and decoder are trained jointly such that the output minimizes a reconstruction error in the sense of the Kullback–Leibler divergence between the parametric posterior and the true posterior.[10][11][12]

Formulation

 
The basic scheme of a variational autoencoder. The model receives   as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces   as similar as possible to  .

From a formal perspective, given an input dataset   characterized by an unknown probability distribution  , the objective is to model or approximate the data's true distribution   using a parametrized distribution   having parameters  . Let   be a random vector jointly-distributed with  . Conceptually   will represent a latent encoding of  . Marginalizing over   gives

 

where   represents the joint distribution under   of the observable data   and its latent representation or encoding  . According to the chain rule, the equation can be rewritten as

 

In the vanilla variational autoencoder,   is usually taken to be a finite-dimensional vector of real numbers, and   to be a Gaussian distribution. Then   is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

  • Prior  
  • Likelihood  
  • Posterior  

Unfortunately, the computation of   is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

 

with   defined as the set of real values that parametrize  . This is sometimes called amortized inference, since by "investing" in finding a good  , one can later infer   from   quickly without doing any integrals.

In this way, the problem is of finding a good probabilistic autoencoder, in which the conditional likelihood distribution   is computed by the probabilistic decoder, and the approximated posterior distribution   is computed by the probabilistic encoder.

Evidence lower bound (ELBO)

As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders, the idea is to jointly optimize the generative model parameters   to reduce the reconstruction error between the input and the output, and   to make   as close as possible to  . As reconstruction loss, mean squared error and cross entropy are often used.

As distance loss between the two distributions the reverse Kullback–Leibler divergence   is a good choice to squeeze   under  .[1][13]

The distance loss just defined is expanded as

 


Now define the evidence lower bound (ELBO): Maximizing the ELBO is equivalent to simultaneously maximizing   and minimizing  . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior   from the exact posterior  .

For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.

Reparameterization

 
The scheme of the reparameterization trick. The randomness variable   is injected into the latent space   as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To efficient search for  the typical method is gradient descent.

It is straightforward to find However,  does not allow one to put the   inside the expectation, since   appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation[14]) bypasses this difficulty.[1][15][16]


The most important example is when   is normally distributed, as  .

 
The scheme of a variational autoencoder after the reparameterization trick.

This can be reparametrized by letting   be a "standard random number generator", and construct   as  . Here,   is obtained by the Cholesky decomposition: Then we have and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent.

Since we reparametrized  , we need to find  . Let   by the probability density function for  , then where   is the Jacobian matrix of   with respect to  . Since  , this is  

Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.

 -VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for   values greater than one. This architecture can discover disentangled latent factors without supervision.[17][18]

The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.[19]

Some structures directly deal with the quality of the generated samples[20][21] or implement more than one latent space to further improve the representation learning.[22][23]

Some architectures mix VAE and generative adversarial networks to obtain hybrid models.[24][25][26]

See also

References

  1. ^ a b c Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
  2. ^ Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural networks". AIChE Journal. 37 (2): 233–243. doi:10.1002/aic.690370209.
  3. ^ Hinton, G. E.; Salakhutdinov, R. R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662. S2CID 1658773.
  4. ^ "A Beginner's Guide to Variational Methods: Mean-Field Approximation". Eric Jang. 2016-07-08.
  5. ^ Dilokthanakul, Nat; Mediano, Pedro A. M.; Garnelo, Marta; Lee, Matthew C. H.; Salimbeni, Hugh; Arulkumaran, Kai; Shanahan, Murray (2017-01-13). "Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders". arXiv:1611.02648 [cs.LG].
  6. ^ Hsu, Wei-Ning; Zhang, Yu; Glass, James (December 2017). "Unsupervised ___domain adaptation for robust speech recognition via variational autoencoder-based data augmentation". 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 16–23. arXiv:1707.06265. doi:10.1109/ASRU.2017.8268911. ISBN 978-1-5090-4788-8. S2CID 22681625.
  7. ^ Ehsan Abbasnejad, M.; Dick, Anthony; van den Hengel, Anton (2017). Infinite Variational Autoencoder for Semi-Supervised Learning. pp. 5888–5897.
  8. ^ Xu, Weidi; Sun, Haoze; Deng, Chao; Tan, Ying (2017-02-12). "Variational Autoencoder for Semi-Supervised Text Classification". Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1).
  9. ^ Kameoka, Hirokazu; Li, Li; Inoue, Shota; Makino, Shoji (2019-09-01). "Supervised Determined Source Separation with Multichannel Variational Autoencoder". Neural Computation. 31 (9): 1891–1914. doi:10.1162/neco_a_01217. PMID 31335290. S2CID 198168155.
  10. ^ An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1).
  11. ^ Khobahi, S.; Soltanalian, M. (2019). "Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding". arXiv:1911.12410 [eess.SP].
  12. ^ Kingma, Diederik P.; Welling, Max (2019). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4): 307–392. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.
  13. ^ "From Autoencoder to Beta-VAE". Lil'Log. 2018-08-12.
  14. ^ Rezende, Danilo Jimenez; Mohamed, Shakir; Wierstra, Daan (2014-06-18). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models". International Conference on Machine Learning. PMLR: 1278–1286.
  15. ^ Bengio, Yoshua; Courville, Aaron; Vincent, Pascal (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv:1206.5538. doi:10.1109/TPAMI.2013.50. ISSN 1939-3539. PMID 23787338. S2CID 393948.
  16. ^ Kingma, Diederik P.; Rezende, Danilo J.; Mohamed, Shakir; Welling, Max (2014-10-31). "Semi-Supervised Learning with Deep Generative Models". arXiv:1406.5298 [cs.LG].
  17. ^ Higgins, Irina; Matthey, Loic; Pal, Arka; Burgess, Christopher; Glorot, Xavier; Botvinick, Matthew; Mohamed, Shakir; Lerchner, Alexander (2016-11-04). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework". {{cite journal}}: Cite journal requires |journal= (help)
  18. ^ Burgess, Christopher P.; Higgins, Irina; Pal, Arka; Matthey, Loic; Watters, Nick; Desjardins, Guillaume; Lerchner, Alexander (2018-04-10). "Understanding disentangling in β-VAE". arXiv:1804.03599 [stat.ML].
  19. ^ Sohn, Kihyuk; Lee, Honglak; Yan, Xinchen (2015-01-01). "Learning Structured Output Representation using Deep Conditional Generative Models" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
  20. ^ Dai, Bin; Wipf, David (2019-10-30). "Diagnosing and Enhancing VAE Models". arXiv:1903.05789 [cs.LG].
  21. ^ Dorta, Garoe; Vicente, Sara; Agapito, Lourdes; Campbell, Neill D. F.; Simpson, Ivor (2018-07-31). "Training VAEs Under Structured Residuals". arXiv:1804.01050 [stat.ML].
  22. ^ Tomczak, Jakub; Welling, Max (2018-03-31). "VAE with a VampPrior". International Conference on Artificial Intelligence and Statistics. PMLR: 1214–1223. arXiv:1705.07120.
  23. ^ Razavi, Ali; Oord, Aaron van den; Vinyals, Oriol (2019-06-02). "Generating Diverse High-Fidelity Images with VQ-VAE-2". arXiv:1906.00446 [cs.LG].
  24. ^ Larsen, Anders Boesen Lindbo; Sønderby, Søren Kaae; Larochelle, Hugo; Winther, Ole (2016-06-11). "Autoencoding beyond pixels using a learned similarity metric". International Conference on Machine Learning. PMLR: 1558–1566. arXiv:1512.09300.
  25. ^ Bao, Jianmin; Chen, Dong; Wen, Fang; Li, Houqiang; Hua, Gang (2017). "CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training". pp. 2745–2754. arXiv:1703.10155 [cs.CV].
  26. ^ Gao, Rui; Hou, Xingsong; Qin, Jie; Chen, Jiaxin; Liu, Li; Zhu, Fan; Zhang, Zhao; Shao, Ling (2020). "Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning". IEEE Transactions on Image Processing. 29: 3665–3680. Bibcode:2020ITIP...29.3665G. doi:10.1109/TIP.2020.2964429. ISSN 1941-0042. PMID 31940538. S2CID 210334032.