Content deleted Content added
No edit summary |
|||
(25 intermediate revisions by 14 users not shown) | |||
Line 1:
{{WikiProject banner shell|class=C|
{{WikiProject Computer science |importance=Low}}
}}
Hi, the article is really interesting and well detailed, I believe it will be a really helpful starting point for those who are willing to study this topic. I just fixed some minor things, like a missing comma or repleced a term with a synonim. It would be nice if you could add a paragraph with some applications of this neural network :) --[[User:Lavalec|Lavalec]] ([[User talk:Lavalec|talk]]) 14:00, 18 June 2021 (UTC)
Line 6 ⟶ 9:
Good article, but I had to get rid of a bunch of unnecessary fluff in the Architecture section which obscured the point (diff : https://en.wikipedia.org/w/index.php?title=Variational_autoencoder&type=revision&diff=1040705234&oldid=1039806485 ). 26 August 2021
I disagree, the article really needs attention, it is very hard to understand the "Formulation" part now. I propose the following changes for the first paragraphs, but subsequent ones need revision as well:
From a formal perspective, given an input <s>dataset</s> vector <math>\mathbf{x}</math> <s>characterized by</s> '''from''' an unknown probability <s>function</s> '''distribution''' <math>P(\mathbf{x})</math> <s>and a multivariate latent encoding vector <math>\mathbf{z}</math> </s>, the objective is to model <s>the data</s> <math>P(\mathbf{x})</math> as a '''parametric''' distribution with density <math>p_\theta(\mathbf{x})</math>, where <math>\theta</math> is a vector of parameters to be learned. <s>defined as the set of the network parameters.</s>
For the parametric model we assume that each <math>\mathbf{x}</math> is associated with (arises from) a latent encoding vector <math>\mathbf{z}</math>, and we write <math>p_\theta(\mathbf{x}, \mathbf{z})</math> to denote their joint density.
<s>It is possible to formalize this distribution as</s> We can then write
: <math>p_\theta(\mathbf{x}) = \int_{\mathbf{z}}p_\theta(\mathbf{x,z}) \, d\mathbf{z} </math>
<s>where <math>p_\theta</math> is the [[Model evidence|evidence]] of the model's data with [[Marginalization (probability)|marginalization]] performed over unobserved variables and thus <math>p_\theta(\mathbf{x,z})</math> represents the [[joint distribution]] between input data and its latent representation according to the network parameters <math>\theta</math>.</s>
[[Special:Contributions/193.219.95.139|193.219.95.139]] ([[User talk:193.219.95.139|talk]]) 10:18, 2 October 2021 (UTC)
== Observations and suggestions for improvements ==
Line 28 ⟶ 45:
[[User:Ettmajor|Ettmajor]] ([[User talk:Ettmajor|talk]]) 10:06, 11 July 2021 (UTC)
== Does the prior <math>p(z)</math> depend on <math>\theta</math> or not? ==
In a vanilla Gaussian VAE, the prior follows a standard Gaussian with zero mean and unit variance, i.e., there is no parametrization (<math>\theta</math> or whatsoever) concerning the prior <math>p(z)</math> of the latent representations.
On the other hand, the article as well as [Kingma&Welling2014] both parametrize the prior <math>p_\theta(z)</math> with <math>\theta</math>, just as the likelihood <math>p_\theta(x\mid z)</math>.
Clearly, the latter makes sense, since it is the very goal to learn <math>\theta</math> through the probabilistic decoder as generative model for the likelihood <math>p_\theta(x\mid z)</math>.
So is there a deeper meaning or sense in parametrizing the prior as <math>p_\theta(z)</math> as well, with the very same parameters <math>\theta</math> as the likelihood, or is it in fact a typo/mistake? <!-- Template:Unsigned IP --><small class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/46.223.162.38|46.223.162.38]] ([[User talk:46.223.162.38#top|talk]]) 22:11, 11 October 2021 (UTC)</small> <!--Autosigned by SineBot-->
The prior is not dependent on the paramterers <math>\theta</math>, but rather on a different set of parameters <math>\phi</math>. <!-- Template:Unsigned IP --><small class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/134.106.109.104|134.106.109.104]] ([[User talk:134.106.109.104#top|talk]]) 12:22, 14 September 2022 (UTC)</small> <!--Autosigned by SineBot-->
:I also found this incredibly confusing. As the prior on z is usually fixed and doesn't depend on any parameter. [[User:EitanPorat|EitanPorat]] ([[User talk:EitanPorat|talk]]) 00:16, 19 March 2023 (UTC)
::I see the confusion. p(z) is a probability distribution, but sometimes the same notation is used in conjunction with a parameter set to indicate that actually it is a parameterized function! The article should be cleared up. The encoder should be called q_phi everywhere and the decoder should be called p_theta. The reason is that to optimize the encoder you need gradients that only come from the KL divergence and then you take the derivative of the free energy with regard to the parameters of the encoder. Those gradients update only the encoder parameters. But the encoder also gets the reconstruction gradients from theta! [[Special:Contributions/46.199.5.20|46.199.5.20]] ([[User talk:46.199.5.20|talk]]) 19:47, 26 December 2024 (UTC)
== The image shows just a normal autoencoder, not a variational autoencoder ==
There is an image with a caption saying it is a variational autoencoder, but it is showing just a plain autoencoder.
In a different section, there is something described as a "trick", which seems to be the central point that distinguishes autoencoders from variational autoencoders.
I'm not sure that image should just be removed, or whether it make sense in the section anyway. [[User:Volker Siegel|Volker Siegel]] ([[User talk:Volker Siegel|talk]]) 14:18, 24 January 2022 (UTC)
:Just to make this point clear: The reparameterization trick is for the gradients! The trick separates the source of randomness to another node in the DAG that does not have any parameters, so that we can propagate gradients through the rest of the DAG that is now a deterministic function. [[Special:Contributions/82.102.110.228|82.102.110.228]] ([[User talk:82.102.110.228|talk]]) 18:57, 27 December 2024 (UTC)
== This is a highly technical topic ==
In the past users have removed much of the technicality involved in the topic. Wikipedia does not have a limit to the depth of technicality, however Simple Wikipedia does. If you find yourself wanting to remove technical depth from the article, please edit the Simple Wikipedia article. [[Special:Contributions/2A01:C23:7C81:1A00:2B9B:EB91:3CC5:3222|2A01:C23:7C81:1A00:2B9B:EB91:3CC5:3222]] ([[User talk:2A01:C23:7C81:1A00:2B9B:EB91:3CC5:3222|talk]]) 10:31, 19 November 2022 (UTC)
== Overview section is poorly written ==
The architecture section is filled with unclear phrases and undefined terms. For example, "noise distribution", "q-distributions or variational posteriors", "p-distributions", "amortized approach", "which is usually intractable" (what is intractable?), "free energy expression". None of these are defined. It is unclear if this section of the article is useful to anyone who is not already familiar with how variational autoencoders work. [[User:Joshuame13|Joshuame13]] ([[User talk:Joshuame13|talk]]) 15:14, 31 January 2023 (UTC)
:I've fixed most of those. The free energy really needs its own section. It is a lower bound that is obtained by using Jensen's inequality on the log likelihood. However, I don't think that Jenssen's inequality is within the scope of this article. [[Special:Contributions/46.199.5.20|46.199.5.20]] ([[User talk:46.199.5.20|talk]]) 19:50, 26 December 2024 (UTC)
== The ELBO section needs more derivation ==
''"The form given is not very convenient for maximization, but the following, equivalent form, is:"''
There should be more steps to explain how the equivalent form is obtained from the "given" one. Also, the dot placeholder notation is inconsistent, changing from <math>p_\theta(\cdot|x)</math> to <math>p_\theta(\cdot)</math>. [[User:PromethiumL|PromethiumL]] ([[User talk:PromethiumL|talk]]) 18:08, 12 February 2023 (UTC)
:I agree p_theta(z) doesn't make sense. [[User:EitanPorat|EitanPorat]] ([[User talk:EitanPorat|talk]]) 00:17, 19 March 2023 (UTC)
::Agreed. It should be p_phi(z) or even better q_phi(z). [[Special:Contributions/46.199.5.20|46.199.5.20]] ([[User talk:46.199.5.20|talk]]) 20:22, 26 December 2024 (UTC)
== Rating this article C-class ==
This article has great potential. Excellent technical content. But I just rated it "C" because it seems to have gained both content and noise over the past six months. I've tried for a couple of hours to improve the clarity of the central idea of a VAE, but I'm not satisfied with my efforts. In particular, it is still unclear to me whether both the encoder and decoder are technically random, whether any randomness should be added in the decoder, or what (beyond Z) is modeled with a multimodal Gaussian in the basic VAE. I see no reason why this article should not be accessible both to casual readers and to the technically proficient, but we are far from there yet.
In particular, the introductory figure shows x being mapped to a gaussian figure and back to x'. It would be good to explicitly state how the encoder and decoder in this figure relate to the various distributions used throughout the article, but I'm not confident on how to do so. [[User:Yoderj|Yoderj]] ([[User talk:Yoderj|talk]]) 19:25, 15 March 2024 (UTC)
I will try to make simple answers to your question. The encoder is a bad name and confuses people. In actuality, it is a gaussian distribution. It has a mean and a variance, which are each parameters given by a neural network. This network is initially random, and is trained (using gradients from the "loss function"). The decoder is also a gaussian distribution. It also has a mean and a variance, which are given by another neural network. Is the decoder technically random? It depends. If you are training, you want to make an estimate. To do the estimate, you have to take samples, which means that the result is random. When you are training, the decoder is random. On the other hand, if you are just doing inference, you can have non-randomness in the decoder. Since you are having a gaussian output, you can say I will do maximum a posteriori and only take one sample from the encoder. You may ask, hey unidentified wikipedian. If it is a gaussian decoder don't you also have to add the variance? That is a very fair question, but there are many applications where you can ignore it. Then you only take one sample out.
I hope that this clears things up. We have four variables. The mean and variance of the encoder. And the mean and variance of the decoder. These variables can be multidimensional for the multivariate Gaussian, but they are still four variables. Here are some equations to help you understand:
z = mu(x) + sigma(x)*epsilon # reparameterization trick
x' = MU(z) + SIGMA(z)*epsilon
And here is the legend:
x: input
z: sample from the latent, aka sample from the encoder, aka output of mu(x) plus output of sigma(x) with randomness
mu, sigma: encoder neural networks
MU, SIGMA: decoder neural networks
x': output
At the end of the day, people have to juggle the interaction of two probability distributions. I doubt that it can be simplified enough for the general populace at this time.
[[Special:Contributions/46.199.5.20|46.199.5.20]] ([[User talk:46.199.5.20|talk]]) 19:34, 26 December 2024 (UTC)
|