Diffusion model: Difference between revisions

Content deleted Content added
No edit summary
notation update
Line 18:
Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from [[non-equilibrium thermodynamics]], especially [[diffusion]].<ref>{{Cite journal |last1=Sohl-Dickstein |first1=Jascha |last2=Weiss |first2=Eric |last3=Maheswaranathan |first3=Niru |last4=Ganguli |first4=Surya |date=2015-06-01 |title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics |url=http://proceedings.mlr.press/v37/sohl-dickstein15.pdf |journal=Proceedings of the 32nd International Conference on Machine Learning |language=en |publisher=PMLR |volume=37 |pages=2256–2265|arxiv=1503.03585 }}</ref>
 
Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a [[Normal distribution|Gaussian distribution]] <math>\mathcal{N}(0, I)</math>. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.
 
The equilibrium distribution is the Gaussian distribution <math>\mathcal{N}(0, I)</math>, with pdf <math>\rho(x) \propto e^{-\frac 12 \|x\|^2}</math>. This is just the [[Maxwell–Boltzmann distribution]] of particles in a potential well <math>V(x) = \frac 12 \|x\|^2</math> at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a [[Brownian motion|Brownian walker]]) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.
 
=== Denoising Diffusion Probabilistic Model (DDPM) ===
Line 34:
* <math>\tilde \sigma_t := \frac{\sigma_{t-1}}{\sigma_{t}}\sqrt{\beta_t}</math>
* <math>\tilde\mu_t(x_t, x_0) :=\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_{t-1})x_t +\sqrt{\bar\alpha_{t-1}}(1-\alpha_{t})x_0}{\sigma_{t}^2}</math>
* <math>\mathcal{N}(\mu, \Sigma)</math> is the normal distribution with mean <math>\mu</math> and variance <math>\Sigma</math>, and <math>\mathcal{N}(x | \mu, \Sigma)</math> is the probability density at <math>x</math>.
* A vertical bar denotes [[Conditioning (probability)|conditioning]].
 
A '''forward diffusion process''' starts at some starting point <math>x_0 \sim q</math>, where <math>q</math> is the probability distribution to be learned, then repeatedly adds noise to it by<math display="block">x_t = \sqrt{1-\beta_t} x_{t-1} + \sqrt{\beta_t} z_t</math>where <math>z_1, ..., z_T</math> are IID samples from <math>\mathcal{N}(0, I)</math>. This is designed so that for any starting distribution of <math>x_0</math>, we have <math>\lim_t x_t|x_0</math> converging to <math>\mathcal{N}(0, I)</math>.
 
The entire diffusion process then satisfies<math display="block">q(x_{0:T}) = q(x_0)q(x_1|x_0) \cdots q(x_T|x_{T-1}) = q(x_0) \mathcal{N}(x_1 | \sqrt{\alpha_1} x_0, \beta_1 I) \cdots \mathcal{N}(x_T | \sqrt{\alpha_T} x_{T-1}, \beta_T I)</math>or<math display="block">\ln q(x_{0:T}) = \ln q(x_0) - \sum_{t=1}^T \frac{1}{2\beta_t} \| x_t - \sqrt{1-\beta_t}x_{t-1}\|^2 + C</math>where <math>C</math> is a normalization constant and often omitted. In particular, we note that <math>x_{1:T}|x_0</math> is a [[gaussian process]], which affords us considerable freedom in [[Reparameterization trick|reparameterization]]. For example, by standard manipulation with gaussian process, <math display="block">x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math><math display="block">x_{t-1} | x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)</math>In particular, notice that for large <math>t</math>, the variable <math>x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math> converges to <math>\mathcal{N}(0, I)</math>. That is, after a long enough diffusion process, we end up with some <math>x_T</math> that is very close to <math>\mathcal{N}(0, I)</math>, with all traces of the original <math>x_0 \sim q</math> gone.
 
For example, since<math display="block">x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math>we can sample <math>x_{t}|x_0</math> directly "in one step", instead of going through all the intermediate steps <math>x_1, x_2, ..., x_{t-1}</math>.
Line 68:
 
==== Backward diffusion ====
The key idea of DDPM is to use a neural network parametrized by <math>\theta</math>. The network takes in two arguments <math>x_t, t</math>, and outputs a vector <math>\mu_\theta(x_t, t)</math> and a matrix <math>\Sigma_\theta(x_t, t)</math>, such that each step in the forward diffusion process can be approximately undone by <math>x_{t-1} \sim \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>. This then gives us a backward diffusion process <math>p_\theta</math> defined by<math display="block">p_\theta(x_T) = \mathcal{N}(x_T | 0, I)</math><math display="block">p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1} | \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>The goal now is to learn the parameters such that <math>p_\theta(x_0)</math> is as close to <math>q(x_0)</math> as possible. To do that, we use [[maximum likelihood estimation]] with variational inference.
 
==== Variational inference ====
The [[Evidence lower bound|ELBO inequality]] states that <math>\ln p_\theta(x_0) \geq E_{x_{1:T}\sim q(\cdot | x_0)}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)] </math>, and taking one more expectation, we get<math display="block">E_{x_0 \sim q}[\ln p_\theta(x_0)] \geq E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)] </math>We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.
 
Define the loss function<math display="block">L(\theta) := -E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)]</math>and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to<ref name=":7">{{Cite web |last=Weng |first=Lilian |date=2021-07-11 |title=What are Diffusion Models? |url=https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ |access-date=2023-09-24 |website=lilianweng.github.io |language=en}}</ref><math display="block">L(\theta) = \sum_{t=1}^T E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} | x_t)] + E_{x_0 \sim q}[D_{KL}(q(x_T|x_0) \| p_\theta(x_T))] + C</math>where <math>C</math> does not depend on the parameter, and thus can be ignored. Since <math>p_\theta(x_T) = \mathcal{N}(x_T | 0, I)</math> also does not depend on the parameter, the term <math>E_{x_0 \sim q}[D_{KL}(q(x_T|x_0) \| p_\theta(x_T))]</math> can also be ignored. This leaves just <math>L(\theta ) = \sum_{t=1}^T L_t</math> with <math>L_t = E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} | x_t)]</math> to be minimized.
 
==== Noise prediction network ====
Since <math>x_{t-1} | x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)</math>, this suggests that we should use <math>\mu_\theta(x_t, t) = \tilde \mu_t(x_t, x_0)</math>; however, the network does not have access to <math>x_0</math>, and so it has to estimate it instead. Now, since <math>x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math>, we may write <math>x_t = \sqrt{\bar\alpha_t} x_{0} + \sigma_t z</math>, where <math>z</math> is some unknown gaussian noise. Now we see that estimating <math>x_0</math> is equivalent to estimating <math>z</math>.
 
Therefore, let the network output a noise vector <math>\epsilon_\theta(x_t, t)</math>, and let it predict<math display="block">\mu_\theta(x_t, t) =\tilde\mu_t\left(x_t, \frac{x_t - \sigma_t \epsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}\right) = \frac{x_t - \epsilon_\theta(x_t, t) \beta_t/\sigma_t}{\sqrt{\alpha_t}}</math>It remains to design <math>\Sigma_\theta(x_t, t)</math>. The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value <math>\Sigma_\theta(x_t, t) = \zeta_t^2 I</math>, where either <math>\zeta_t^2 = \beta_t \text{ or } \tilde\sigma_t^2</math> yielded similar performance.
 
With this, the loss simplifies to <math display="block">L_t = \frac{\beta_t^2}{2\alpha_t\sigma_{t}^2\zeta_t^2} E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right] + C</math>which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right]</math>resulted in better models.
 
=== Backward diffusion process ===
Line 87:
# Compute the noise estimate <math>\epsilon \leftarrow \epsilon_\theta(x_t, t)</math>
# Compute the original data estimate <math>\tilde x_0 \leftarrow (x_t - \sigma_t \epsilon) / \sqrt{\bar \alpha_t} </math>
# Sample the previous data <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math>
# Change time <math>t \leftarrow t-1</math>
 
Line 116:
 
==== Annealing the score function ====
Suppose we need to model the distribution of images, and we want <math>x_0 \sim \mathcal{N}(0, I)</math>, a white-noise image. Now, most white-noise images do not look like real images, so <math>q(x_0) \approx 0</math> for large swaths of <math>x_0 \sim \mathcal{N}(0, I)</math>. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function <math>\nabla_{x_t}\ln q(x_t)</math> at that point, then we cannot impose the time-evolution equation on a particle:<math display="block">dx_{t}= \nabla_{x_t}\ln q(x_t) d t+d W_t</math>To deal with this problem, we perform [[Simulated annealing|annealing]]. If <math>q</math> is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.
 
=== Continuous diffusion processes ===
Line 125:
Now, the equation is exactly a special case of the [[Brownian dynamics|overdamped Langevin equation]]<math display="block">dx_t = -\frac{D}{k_BT} (\nabla_x U)dt + \sqrt{2D}dW_t</math>where <math>D</math> is diffusion tensor, <math>T</math> is temperature, and <math>U</math> is potential energy field. If we substitute in <math>D= \frac 12 \beta(t)I, k_BT = 1, U = \frac 12 \|x\|^2</math>, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.
 
Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to <math>q</math> at time <math>t=0</math>, then after a long time, the cloud of particles would settle into the stable distribution of <math>\mathcal{N}(0, I)</math>. Let <math>\rho_t</math> be the density of the cloud of particles at time <math>t</math>, then we have<math display="block">\rho_0 = q; \quad \rho_T \approx \mathcal{N}(0, I)</math>and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.
 
By [[Fokker–Planck equation|Fokker-Planck equation]], the density of the cloud evolves according to<math display="block">\partial_t \ln \rho_t = \frac 12 \beta(t) \left(
Line 139:
and so
<math display="block">x_{t}|x_0 \sim N\left(e^{-\frac 12\int_0^t \beta(t)dt} x_{0}, \left(1- e^{-\int_0^t \beta(t)dt}\right) I \right)</math>
In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling <math>x_0 \sim q, z \sim \mathcal{N}(0, I)</math>, then get <math>x_t = e^{-\frac 12\int_0^t \beta(t)dt} x_{0} + \left(1- e^{-\int_0^t \beta(t)dt}\right) z</math>. That is, we can quickly sample <math>x_t \sim \rho_t</math> for any <math>t \geq 0</math>.
 
Now, define a certain probability distribution <math>\gamma</math> over <math>[0, \infty)</math>, then the score-matching loss function is defined as the expected Fisher divergence:
<math display="block">L(\theta) = E_{t\sim \gamma, x_t \sim \rho_t}[\|f_\theta(x_t, t)\|^2 + 2\nabla\cdot f_\theta(x_t, t)]</math>
After training, <math>f_\theta(x_t, t) \approx \nabla \ln\rho_t</math>, so we can perform the backwards diffusion process by first sampling <math>x_T \sim \mathcal{N}(0, I)</math>, then integrating the SDE from <math>t=T</math> to <math>t=0</math>:
<math display="block">x_{t-dt}=x_t + \frac{1}{2} \beta(t) x_{t} d t + \beta(t) f_\theta(x_t, t) d t+\sqrt{\beta(t)} d W_t</math>
This may be done by any SDE integration method, such as [[Euler–Maruyama method]].
Line 159:
<math display="block">\nabla_{x_t}\ln q(x_t) = \frac{1}{\sigma_{t}^2}(-x_t + \sqrt{\bar\alpha_t} E_q[x_0|x_t])</math>
As described previously, the DDPM loss function is <math>\sum_t L_{simple, t}</math> with
<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right]</math>
where <math>x_t =\sqrt{\bar\alpha_t} x_{0} + \sigma_tz
</math>. By a change of variables,
Line 170:
 
Conversely, the continuous limit <math>x_{t-1} = x_{t-dt}, \beta_t = \beta(t) dt, z_t\sqrt{dt} = dW_t</math> of the backward equation
<math display="block">x_{t-1} = \frac{x_t}{\sqrt{\alpha_t}}- \frac{ \beta_t}{\sigma_{t}\sqrt{\alpha_t }} \epsilon_\theta(x_t, t) + \sqrt{\beta_t} z_t; \quad z_t \sim \mathcal{N}(0, I)</math>
gives us precisely the same equation as score-based diffusion:
<math display="block">x_{t-dt} = x_t(1+\beta(t)dt / 2) + \beta(t) \nabla_{x_t}\ln q(x_t) dt + \sqrt{\beta(t)}dW_t</math>Thus, a denoising network can be used as for score-based diffusion.
Line 185:
 
=== Denoising Diffusion Implicit Model (DDIM) ===
The original DDPM method for generating images is slow, since the forward diffusion process usually takes <math>T \sim 1000</math> to make the distribution of <math>x_T</math> to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as <math>x_t | x_0</math> is gaussian for all <math>t \geq 1</math>, the backward diffusion process does not allow skipping steps. For example, to sample <math>x_{t-2}|x_{t-1} \sim \mathcal{N}(\mu_\theta(x_{t-1}, t-1), \Sigma_\theta(x_{t-1}, t-1))</math> requires the model to first sample <math>x_{t-1}</math>. Attempting to directly sample <math>x_{t-2}|x_t</math> would require us to marginalize out <math>x_{t-1}</math>, which is generally intractable.
 
DDIM<ref>{{Cite arXiv |last1=Song |first1=Jiaming |last2=Meng |first2=Chenlin |last3=Ermon |first3=Stefano |date=3 Oct 2023 |title=Denoising Diffusion Implicit Models |class=cs.LG |eprint=2010.02502}}</ref> is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.
Line 191:
In detail, the DDIM sampling method is as follows. Start with the forward diffusion process <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sigma_t \epsilon</math>. Then, during the backward denoising process, given <math>x_t, \epsilon_\theta(x_t, t)</math>, the original data is estimated as <math display="block">x_0' = \frac{x_t - \sigma_t \epsilon_\theta(x_t, t)}{ \sqrt{\bar\alpha_t}}</math>then the backward diffusion process can jump to any step <math>0 \leq s < t</math>, and the next denoised sample is <math display="block">x_{s} = \sqrt{\bar\alpha_{s}} x_0'
+ \sqrt{\sigma_{s}^2 - (\sigma'_s)^2} \epsilon_\theta(x_t, t)
+ \sigma_s' \epsilon</math>where <math>\sigma_s'</math> is an arbitrary real number within the range <math>[0, \sigma_s]</math>, and <math>\epsilon \sim \mathcal{N}(0, I)</math> is a newly sampled gaussian noise.<ref name=":7" /> If all <math>\sigma_s' = 0</math>, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.
 
The original paper recommended defining a single "eta value" <math>\eta \in [0, 1]</math>, such that <math>\sigma_s' = \eta \tilde\sigma_s</math>. When <math>\eta = 1</math>, this is the original DDPM. When <math>\eta = 0</math>, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them.
Line 204:
 
=== Architectural improvements ===
<ref>{{Cite journal |last1=Nichol |first1=Alexander Quinn |last2=Dhariwal |first2=Prafulla |date=2021-07-01 |title=Improved Denoising Diffusion Probabilistic Models |url=https://proceedings.mlr.press/v139/nichol21a.html |journal=Proceedings of the 38th International Conference on Machine Learning |language=en |publisher=PMLR |pages=8162–8171}}</ref> proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math>, they recommended sampling from <math>\mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), (\sigma_t^v \tilde\sigma_t^{1-v})^2 I)</math> for a learned parameter <math>v</math>.
 
In the ''v-prediction'' formalism, the noising formula <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon_t</math> is reparameterised by an angle <math>\phi_t</math> such that <math>\cos \phi_t = \sqrt{\bar\alpha_t}</math> and a "velocity" defined by <math>\cos\phi_t \epsilon_t - \sin\phi_t x_0</math>. The network is trained to predict the velocity <math>\hat v_\theta</math>, and denoising is by <math>x_{\phi_t - \delta} = \cos(\delta)\; x_{\phi_t} - \sin(\delta) \hat{v}_{\theta}\; (x_{\phi_t}) </math>.<ref>{{Cite conference|conference=The Tenth International Conference on Learning Representations (ICLR 2022)|last1=Salimans|first1=Tim|last2=Ho|first2=Jonathan|date=2021-10-06|title=Progressive Distillation for Fast Sampling of Diffusion Models|url=https://openreview.net/forum?id=TIdIXIpzhoI|language=en}}</ref> This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e. <math>\phi_t = 90^\circ</math>) and then reverse it, whereas the standard parameterization never reaches total noise since <math>\sqrt{\bar\alpha_t} > 0</math> is always true.<ref>{{Cite conference|conference=IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)|last1=Lin |first1=Shanchuan |last2=Liu |first2=Bingchen |last3=Li |first3=Jiashi |last4=Yang |first4=Xiao |date=2024 |title=Common Diffusion Noise Schedules and Sample Steps Are Flawed |url=https://openaccess.thecvf.com/content/WACV2024/html/Lin_Common_Diffusion_Noise_Schedules_and_Sample_Steps_Are_Flawed_WACV_2024_paper.html |language=en |pages=5404–5411}}</ref>