Revision as of 02:45, 18 December 2024 edit 75.104.64.170 (talk) →Evidence lower bound (ELBO): pedantic wording change Tag: Reverted ← Previous edit		Revision as of 19:40, 26 December 2024 edit undo 46.199.5.20 (talk) →Overview of architecture and operation: Added some clarity on the technical terms. Next edit →
Line 15: == Overview of architecture and operation == A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the [[Expectation–maximization algorithm\|expectation-maximization]] meta-algorithm (e.g. [[Principal_component_analysis\|probabilistic PCA]], (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational [[Posterior_probability\|posteriors]]. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. ~~This~~In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the [[Kullback–Leibler divergence]] (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.<ref name="Kingma2013">{{cite arXiv \|last1=Kingma \|first1=Diederik P. \|last2=Welling \|first2=Max \|title=Auto-Encoding Variational Bayes \|date=2013-12-20 \|class=stat.ML \|eprint=1312.6114}}</ref> More recent approaches replace [[Kullback–Leibler divergence]] (KL-D) with [[Statistical distance\|various statistical distances]], see [[#Statistical distance VAE variants\|see section "Statistical distance VAE variants" below.]].

Variational autoencoder: Difference between revisions