Content deleted Content added
Intro sentence structure + italics |
notation change |
||
Line 29:
Unfortunately, the computation of <math>p_\theta(\mathbf{x})</math> is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as
:<math>q_\
with <math>\
In this way, the overall problem can be easily translated into the autoencoder ___domain, in which the conditional likelihood distribution <math>p_\theta(\mathbf{x}\mid\mathbf{z})</math> is carried by the ''probabilistic decoder'', while the approximated posterior distribution <math>q_\
== ELBO loss function ==
Line 40:
As in every [[deep learning]] problem, it is necessary to define a differentiable loss function in order to update the network weights through [[backpropagation]].
For variational autoencoders the idea is to jointly optimize the generative model parameters <math>\theta</math> to reduce the reconstruction error between the input and the output, and <math>\
As reconstruction loss, [[mean squared error]] and [[cross entropy]] are often used.
As distance loss between the two distributions the reverse Kullback–Leibler divergence <math>D_{KL}(q_\
The distance loss just defined is expanded as
: <math>\begin{align}
D_{KL}(q_\
&= \int q_\
&= \int q_\
&= \log (p_\theta(\mathbf{x})) + \int q_\
&= \log (p_\theta(\mathbf{x})) + \int q_\
&= \log (p_\theta(\mathbf{x})) + E_{\mathbf{z} \sim q_\
&= \log (p_\theta(\mathbf{x})) + D_{KL}(q_\
\end{align}</math>
At this point, it is possible to rewrite the equation as
: <math>\log (p_\theta(\mathbf{x})) - D_{KL}(q_\
The goal is to maximize the [[log-likelihood]] of the [[Left hand side|LHS]] of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one.
Line 68:
The loss function so obtained, also named [[evidence lower bound]] loss function, shortly ELBO, can be written as
: <math>L_{\theta,\
Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that
: <math>-L_{\theta,\
The optimal parameters minimize this loss function. The problem can be summarized as
<math>\theta^*,\
The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters <math>\theta </math> and <math>\
Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called '''reparameterization trick''' to remove the stochastic sampling from the formation, and thus making it differentiable.
Line 89:
The main assumption about the latent space is that it can be considered to be a set of multivariate Gaussian distributions, and thus can be described as
: <math>\mathbf{z} \sim q_\
Given <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> and <math>\odot</math> defined as the element-wise product ([[Hadamard product (matrices)]]), the reparameterization trick modifies the above equation as
|