Revision as of 19:17, 2 July 2022 edit Louen342 (talk \| contribs) 37 edits Intro sentence structure + italics Tag: Visual edit ← Previous edit		Revision as of 06:26, 19 July 2022 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,297 edits notation change Tag: 2017 wikitext editor Next edit →
Line 29: Unfortunately, the computation of <math>p_\theta(\mathbf{x})</math> is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as :<math>q_\~~Phi~~phi(\mathbf{z\mid x}) \approx p_\theta(\mathbf{z\mid x})</math> with <math>\~~Phi~~phi</math> defined as the set of real values that parametrize <math>q</math>. In this way, the overall problem can be easily translated into the autoencoder ___domain, in which the conditional likelihood distribution <math>p_\theta(\mathbf{x}\mid\mathbf{z})</math> is carried by the ''probabilistic decoder'', while the approximated posterior distribution <math>q_\~~Phi~~phi(\mathbf{z}\mid\mathbf{x})</math> is computed by the ''probabilistic encoder''. == ELBO loss function == Line 40: As in every [[deep learning]] problem, it is necessary to define a differentiable loss function in order to update the network weights through [[backpropagation]]. For variational autoencoders the idea is to jointly optimize the generative model parameters <math>\theta</math> to reduce the reconstruction error between the input and the output, and <math>\~~Phi~~phi</math> to make <math>q_\~~Phi~~phi(\mathbf{z\mid x})</math> as close as possible to <math>p_\theta(\mathbf{z}\mid\mathbf{x})</math>. As reconstruction loss, [[mean squared error]] and [[cross entropy]] are often used. As distance loss between the two distributions the reverse Kullback–Leibler divergence <math>D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x}))</math> is a good choice to squeeze <math>q_\~~Phi~~phi(\mathbf{z\mid x})</math> under <math>p_\theta(\mathbf{z}\mid\mathbf{x})</math>.<ref name=":0" /><ref>{{cite web \|title=From Autoencoder to Beta-VAE \|url=https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html \|website=Lil'Log \|language=en \|date=2018-08-12}}</ref> The distance loss just defined is expanded as : <math>\begin{align} D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) &= \int q_\~~Phi~~phi(\mathbf{z\mid x}) \log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})}{p_\theta(\mathbf{z\mid x})} \, d\mathbf{z}\\ &= \int q_\~~Phi~~phi(\mathbf{z\mid x}) \log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})p_\theta(\mathbf{x})}{p_\theta(\mathbf{z,x})} \,d\mathbf{z}\\ &= \int q_\~~Phi~~phi(\mathbf{z\mid x}) \left( \log (p_\theta(\mathbf{x})) + \log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})}{p_\theta(\mathbf{z,x})}\right) d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + \int q_\~~Phi~~phi(\mathbf{z\mid x}) \log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})}{p_\theta(\mathbf{z,x})} \,d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + \int q_\~~Phi~~phi(\mathbf{z\mid x}) \log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})}{p_\theta(\mathbf{x\mid z})p_\theta(\mathbf{z})} \,d\mathbf{z}\\ &= \log (p_\theta(\mathbf{x})) + E_{\mathbf{z} \sim q_\~~Phi~~phi(\mathbf{z\mid x})}(\log \frac{q_\~~Phi~~phi(\mathbf{z\mid x})}{p_\theta(\mathbf{z})} - \log(p_\theta(\mathbf{x\mid z})))\\ &= \log (p_\theta(\mathbf{x})) + D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) - E_{\mathbf{z} \sim q_\~~Phi~~phi(\mathbf{z\mid x})}(\log(p_\theta(\mathbf{x\mid z}))) \end{align}</math> At this point, it is possible to rewrite the equation as : <math>\log (p_\theta(\mathbf{x})) - D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = E_{\mathbf{z} \sim q_\~~Phi~~phi(\mathbf{z\mid x})}(\log(p_\theta(\mathbf{x\mid z}))) - D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z}))</math> The goal is to maximize the [[log-likelihood]] of the [[Left hand side\|LHS]] of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one. Line 68: The loss function so obtained, also named [[evidence lower bound]] loss function, shortly ELBO, can be written as : <math>L_{\theta,\~~Phi~~phi} = -\log (p_\theta(\mathbf{x})) + D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) = -E_{\mathbf{z} \sim q_\~~Phi~~phi(\mathbf{z\|x})}(\log(p_\theta(\mathbf{x\mid z}))) + D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x}) \parallel p_\theta(\mathbf{z})) </math> Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that : <math>-L_{\theta,\~~Phi~~phi} = \log (p_\theta(\mathbf{x})) - D_{KL}(q_\~~Phi~~phi(\mathbf{z\mid x})\parallel p_\theta(\mathbf{z\mid x})) \leq \log (p_\theta(\mathbf{x})) </math> The optimal parameters minimize this loss function. The problem can be summarized as <math>\theta^,\~~Phi~~phi^ = \underset{\theta,\~~Phi~~phi}\operatorname{arg min} \, L_{\theta,\~~Phi~~phi} </math> The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters <math>\theta </math> and <math>\~~Phi~~phi </math>. Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called '''reparameterization trick''' to remove the stochastic sampling from the formation, and thus making it differentiable. Line 89: The main assumption about the latent space is that it can be considered to be a set of multivariate Gaussian distributions, and thus can be described as : <math>\mathbf{z} \sim q_\~~Phi~~phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)</math>.[[File:Reparameterized Variational Autoencoder.png\|thumb\|The scheme of a variational autoencoder after the reparameterization trick. \|300x300px]] Given <math>\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I})</math> and <math>\odot</math> defined as the element-wise product ([[Hadamard product (matrices)]]), the reparameterization trick modifies the above equation as

Variational autoencoder: Difference between revisions