Diffusion model: Difference between revisions

Content deleted Content added
Restored revision 1266167541 by Citation bot (talk): Rv github as citation
Citation bot (talk | contribs)
Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 658/1032
 
(45 intermediate revisions by 21 users not shown)
Line 1:
{{Short description|Deep learning algorithm}}{{About|the technique in generative statistical modeling|3=Diffusion (disambiguation)}}
{{distinguish|Diffusion model (physics)}}
{{Machine learning|Artificial neural network}}
 
In [[machine learning]], '''diffusion models''', also known as '''diffusion-based probabilisticgenerative models''' or '''score-based generative models''', are a class of [[latent variable model|latent variable]] [[generative model|generative]] models. A diffusion model consists of threetwo major components: the forward process, the reversediffusion process, and the reverse sampling procedure.<ref name="chang23design">{{cite arXiv |last1=Chang |first1=Ziyi |last2=Koulieris |first2=George Alex |last3=Shum |first3=Hubert P. H. |title=On the Design Fundamentals of Diffusion Models: A Survey |date=2023 |eprint=2306.04542 |class=csprocess.LG}}</ref> The goal of diffusion models is to learn a [[diffusion process]] for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a [[Wiener process|random walk with drift]] through the space of all possible data.<ref name="song"/> A trained diffusion model can be sampled in many ways, with different efficiency and quality.
 
There are various equivalent formalisms, including [[Markov chain]]s, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.<ref>{{cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023 |title=Diffusion Models in Vision: A Survey |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=9 |pages=10850–10869 |arxiv=2209.04747 |doi=10.1109/TPAMI.2023.3261988 |pmid=37030794 |bibcode=2023ITPAM..4510850C |s2cid=252199918}}</ref> They are typically trained using [[Variational Bayesian methods|variational inference]].<ref name="ho" /> The model responsible for denoising is typically called its "[[#Choice of architecture|backbone]]". The backbone may be of any kind, but they are typically [[U-Net|U-nets]] or [[Transformer (deep learning architecture)|transformers]].
 
{{As of|2024}}, diffusion models are mainly used for [[computer vision]] tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], [[text-to-image model|image generation]], and video generation. These typically involve training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv |last1=Song |first1=Yang |last2=Sohl-Dickstein |first2=Jascha |last3=Kingma |first3=Diederik P. |last4=Kumar |first4=Abhishek |last5=Ermon |first5=Stefano |last6=Poole |first6=Ben |date=2021-02-10 |title=Score-Based Generative Modeling through Stochastic Differential Equations |class=cs.LG |eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv |last1=Gu |first1=Shuyang |last2=Chen |first2=Dong |last3=Bao |first3=Jianmin |last4=Wen |first4=Fang |last5=Zhang |first5=Bo |last6=Chen |first6=Dongdong |last7=Yuan |first7=Lu |last8=Guo |first8=Baining |title=Vector Quantized Diffusion Model for Text-to-Image Synthesis |date=2021 |class=cs.CV |eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.
Line 11 ⟶ 10:
Diffusion-based image generators have seen widespread commercial interest, such as [[Stable Diffusion]] and [[DALL-E]]. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.<ref name="dalle2" />
 
Other than computer vision, diffusion models have also found applications in [[natural language processing]]<ref>{{ Cite arXiv |eprint=2410.18514 |last1=Nie |first1=Shen |last2=Zhu |first2=Fengqi |last3=Du |first3=Chao |last4=Pang |first4=Tianyu |last5=Liu |first5=Qian |last6=Zeng |first6=Guangtao |last7=Lin |first7=Min |last8=Li |first8=Chongxuan |title=Scaling up Masked Diffusion Models on Text |date=2024 |class=cs.AI }}</ref><ref>{{ Cite book |last1=Li |first1=Yifan |last2=Zhou |first2=Kun |last3=Zhao |first3=Wayne Xin |last4=Wen |first4=Ji-Rong |chapter=Diffusion Models for Non-autoregressive Text Generation: A Survey |date=August 2023 |pages=6692–6701 |title=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence |chapter-url=http://dx.doi.org/10.24963/ijcai.2023/750 |___location=California |publisher=International Joint Conferences on Artificial Intelligence Organization |doi=10.24963/ijcai.2023/750|arxiv=2303.06574 |isbn=978-1-956792-03-4 }}</ref> such as [[Natural language generation|text generation]]<ref>{{Cite journal |last1=Han |first1=Xiaochuang |last2=Kumar |first2=Sachin |last3=Tsvetkov |first3=Yulia |date=2023 |title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control |url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 |journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |pages=11575–11596 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.acl-long.647|arxiv=2210.17432 }}</ref><ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Hu |first2=Wenxiang |last3=Wu |first3=Fanyou |last4=Sengamedu |first4=Srinivasan |date=2023 |title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM |url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 |journal=Findings of the Association for Computational Linguistics: EMNLP 2023 |pages=9040–9057 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-emnlp.606|arxiv=2310.15296 }}</ref> and [[Automatic summarization|summarization]],<ref>{{Cite journal |last1=Zhang |first1=Haopeng |last2=Liu |first2=Xiao |last3=Zhang |first3=Jiawei |date=2023 |title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion |url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 |journal=Findings of the Association for Computational Linguistics: ACL 2023 |pages=13089–13100 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-acl.828|arxiv=2305.01735 }}</ref> sound generation,<ref>{{Cite journal |last1=Yang |first1=Dongchao |last2=Yu |first2=Jianwei |last3=Wang |first3=Helin |last4=Wang |first4=Wen |last5=Weng |first5=Chao |last6=Zou |first6=Yuexian |last7=Yu |first7=Dong |date=2023 |title=Diffsound: Discrete Diffusion Model for Text-to-Sound Generation |url=http://dx.doi.org/10.1109/taslp.2023.3268730 |journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing |volume=31 |pages=1720–1733 |doi=10.1109/taslp.2023.3268730 |issn=2329-9290|arxiv=2207.09983 |bibcode=2023ITASL..31.1720Y }}</ref> and reinforcement learning.<ref>{{cite arXiv |last1=Janner |first1=Michael |title=Planning with Diffusion for Flexible Behavior Synthesis |date=2022-12-20 |eprint=2205.09991 |last2=Du |first2=Yilun |last3=Tenenbaum |first3=Joshua B. |last4=Levine |first4=Sergey|class=cs.LG }}</ref><ref>{{cite arXiv |last1=Chi |first1=Cheng |title=Diffusion Policy: Visuomotor Policy Learning via Action Diffusion |date=2024-03-14 |eprint=2303.04137 |last2=Xu |first2=Zhenjia |last3=Feng |first3=Siyuan |last4=Cousineau |first4=Eric |last5=Du |first5=Yilun |last6=Burchfiel |first6=Benjamin |last7=Tedrake |first7=Russ |last8=Song |first8=Shuran|class=cs.RO }}</ref>
 
== Denoising diffusion model ==
 
=== Non-equilibrium thermodynamics ===
Diffusion models were introduced in 2015 as a method to learntrain a model that can sample from a highly complex probability distribution. They used techniques from [[non-equilibrium thermodynamics]], especially [[diffusion]].<ref>{{Cite journal |last1=Sohl-Dickstein |first1=Jascha |last2=Weiss |first2=Eric |last3=Maheswaranathan |first3=Niru |last4=Ganguli |first4=Surya |date=2015-06-01 |title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics |url=http://proceedings.mlr.press/v37/sohl-dickstein15.pdf |journal=Proceedings of the 32nd International Conference on Machine Learning |language=en |publisher=PMLR |volume=37 |pages=2256–2265|arxiv=1503.03585 }}</ref>
 
Consider, for example, how one might model the distribution of all naturally- occurring photos. Each image is a point in the space of all images, and the distribution of naturally- occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a [[Normal distribution|Gaussian distribution]] <math>\mathcal{N}(0, I)</math>. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.
 
The equilibrium distribution is the Gaussian distribution <math>\mathcal{N}(0, I)</math>, with pdf <math>\rho(x) \propto e^{-\frac 12 \|x\|^2}</math>. This is just the [[Maxwell–Boltzmann distribution]] of particles in a potential well <math>V(x) = \frac 12 \|x\|^2</math> at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a [[Brownian motion|Brownian walker]]) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.
 
=== Denoising Diffusion Probabilistic Model (DDPM) ===
Line 34 ⟶ 33:
* <math>\tilde \sigma_t := \frac{\sigma_{t-1}}{\sigma_{t}}\sqrt{\beta_t}</math>
* <math>\tilde\mu_t(x_t, x_0) :=\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_{t-1})x_t +\sqrt{\bar\alpha_{t-1}}(1-\alpha_{t})x_0}{\sigma_{t}^2}</math>
* <math>\mathcal{N}(\mu, \Sigma)</math> is the normal distribution with mean <math>\mu</math> and variance <math>\Sigma</math>, and <math>\mathcal{N}(x | \mu, \Sigma)</math> is the probability density at <math>x</math>.
* A vertical bar denotes [[Conditioning (probability)|conditioning]].
 
A '''forward diffusion process''' starts at some starting point <math>x_0 \sim q</math>, where <math>q</math> is the probability distribution to be learned, then repeatedly adds noise to it by<math display="block">x_t = \sqrt{1-\beta_t} x_{t-1} + \sqrt{\beta_t} z_t</math>where <math>z_1, ..., z_T</math> are IID samples from <math>\mathcal{N}(0, I)</math>. This is designed so that for any starting distribution of <math>x_0</math>, we have <math>\lim_t x_t|x_0</math> converging to <math>\mathcal{N}(0, I)</math>.
 
The entire diffusion process then satisfies<math display="block">q(x_{0:T}) = q(x_0)q(x_1|x_0) \cdots q(x_T|x_{T-1}) = q(x_0) \mathcal{N}(x_1 | \sqrt{\alpha_1} x_0, \beta_1 I) \cdots \mathcal{N}(x_T | \sqrt{\alpha_T} x_{T-1}, \beta_T I)</math>or<math display="block">\ln q(x_{0:T}) = \ln q(x_0) - \sum_{t=1}^T \frac{1}{2\beta_t} \| x_t - \sqrt{1-\beta_t}x_{t-1}\|^2 + C</math>where <math>C</math> is a normalization constant and often omitted. In particular, we note that <math>x_{1:T}|x_0</math> is a [[gaussian process]], which affords us considerable freedom in [[Reparameterization trick|reparameterization]]. For example, by standard manipulation with gaussian process, <math display="block">x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math><math display="block">x_{t-1} | x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)</math>In particular, notice that for large <math>t</math>, the variable <math>x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math> converges to <math>\mathcal{N}(0, I)</math>. That is, after a long enough diffusion process, we end up with some <math>x_T</math> that is very close to <math>\mathcal{N}(0, I)</math>, with all traces of the original <math>x_0 \sim q</math> gone.
 
For example, since<math display="block">x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math>we can sample <math>x_{t}|x_0</math> directly "in one step", instead of going through all the intermediate steps <math>x_1, x_2, ..., x_{t-1}</math>.
Line 68 ⟶ 67:
 
==== Backward diffusion ====
The key idea of DDPM is to use a neural network parametrized by <math>\theta</math>. The network takes in two arguments <math>x_t, t</math>, and outputs a vector <math>\mu_\theta(x_t, t)</math> and a matrix <math>\Sigma_\theta(x_t, t)</math>, such that each step in the forward diffusion process can be approximately undone by <math>x_{t-1} \sim \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>. This then gives us a backward diffusion process <math>p_\theta</math> defined by<math display="block">p_\theta(x_T) = \mathcal{N}(x_T | 0, I)</math><math display="block">p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1} | \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>The goal now is to learn the parameters such that <math>p_\theta(x_0)</math> is as close to <math>q(x_0)</math> as possible. To do that, we use [[maximum likelihood estimation]] with variational inference.
 
==== Variational inference ====
The [[Evidence lower bound|ELBO inequality]] states that <math>\ln p_\theta(x_0) \geq E_{x_{1:T}\sim q(\cdot | x_0)}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)] </math>, and taking one more expectation, we get<math display="block">E_{x_0 \sim q}[\ln p_\theta(x_0)] \geq E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)] </math>We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.
 
Define the loss function<math display="block">L(\theta) := -E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}|x_0)]</math>and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to<ref name=":7">{{Cite web |last=Weng |first=Lilian |date=2021-07-11 |title=What are Diffusion Models? |url=https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ |access-date=2023-09-24 |website=lilianweng.github.io |language=en}}</ref><math display="block">L(\theta) = \sum_{t=1}^T E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} | x_t)] + E_{x_0 \sim q}[D_{KL}(q(x_T|x_0) \| p_\theta(x_T))] + C</math>where <math>C</math> does not depend on the parameter, and thus can be ignored. Since <math>p_\theta(x_T) = \mathcal{N}(x_T | 0, I)</math> also does not depend on the parameter, the term <math>E_{x_0 \sim q}[D_{KL}(q(x_T|x_0) \| p_\theta(x_T))]</math> can also be ignored. This leaves just <math>L(\theta ) = \sum_{t=1}^T L_t</math> with <math>L_t = E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} | x_t)]</math> to be minimized.
 
==== Noise prediction network ====
Since <math>x_{t-1} | x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)</math>, this suggests that we should use <math>\mu_\theta(x_t, t) = \tilde \mu_t(x_t, x_0)</math>; however, the network does not have access to <math>x_0</math>, and so it has to estimate it instead. Now, since <math>x_{t}|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I \right)</math>, we may write <math>x_t = \sqrt{\bar\alpha_t} x_{0} + \sigma_t z</math>, where <math>z</math> is some unknown gaussian noise. Now we see that estimating <math>x_0</math> is equivalent to estimating <math>z</math>.
 
Therefore, let the network output a noise vector <math>\epsilon_\theta(x_t, t)</math>, and let it predict<math display="block">\mu_\theta(x_t, t) =\tilde\mu_t\left(x_t, \frac{x_t - \sigma_t \epsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}\right) = \frac{x_t - \epsilon_\theta(x_t, t) \beta_t/\sigma_t}{\sqrt{\alpha_t}}</math>It remains to design <math>\Sigma_\theta(x_t, t)</math>. The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value <math>\Sigma_\theta(x_t, t) = \zeta_t^2 I</math>, where either <math>\zeta_t^2 = \beta_t \text{ or } \tilde\sigma_t^2</math> yielded similar performance.
 
With this, the loss simplifies to <math display="block">L_t = \frac{\beta_t^2}{2\alpha_t\sigma_{t}^2\zeta_t^2} E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right] + C</math>which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right]</math>resulted in better models.
 
=== Backward diffusion process ===
Line 87 ⟶ 86:
# Compute the noise estimate <math>\epsilon \leftarrow \epsilon_\theta(x_t, t)</math>
# Compute the original data estimate <math>\tilde x_0 \leftarrow (x_t - \sigma_t \epsilon) / \sqrt{\bar \alpha_t} </math>
# Sample the previous data <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math>
# Change time <math>t \leftarrow t-1</math>
 
Line 113 ⟶ 112:
 
==== Learning the score function ====
Given a density <math>q</math>, we wish to learn a score function approximation <math>f_\theta \approx \nabla \ln q</math>. This is '''score matching'''''.''<ref>{{Cite web |title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation {{!}} Yang Song |url=https://yang-song.net/blog/2019/ssm/ |access-date=2023-09-24 |website=yang-song.net}}</ref> Typically, score matching is formalized as minimizing '''Fisher divergence''' function <math>E_q[\|f_\theta(x) - \nabla \ln q(x)\|^2]</math>. By expanding the integral, and performing an integration by parts, <math display="block">E_q[\|f_\theta(x) - \nabla \ln q(x)\|^2] = E_q[\|f_\theta\|^2 + 2\nabla^2\cdot f_\theta] + C</math>giving us a loss function, also known as the [[Scoring rule#Hyvärinen scoring rule|Hyvärinen scoring rule]], that can be minimized by stochastic gradient descent.
 
==== Annealing the score function ====
Suppose we need to model the distribution of images, and we want <math>x_0 \sim \mathcal{N}(0, I)</math>, a white-noise image. Now, most white-noise images do not look like real images, so <math>q(x_0) \approx 0</math> for large swaths of <math>x_0 \sim \mathcal{N}(0, I)</math>. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function <math>\nabla_{x_t}\ln q(x_t)</math> at that point, then we cannot impose the time-evolution equation on a particle:<math display="block">dx_{t}= \nabla_{x_t}\ln q(x_t) d t+d W_t</math>To deal with this problem, we perform [[Simulated annealing|annealing]]. If <math>q</math> is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.
 
=== Continuous diffusion processes ===
Line 125 ⟶ 124:
Now, the equation is exactly a special case of the [[Brownian dynamics|overdamped Langevin equation]]<math display="block">dx_t = -\frac{D}{k_BT} (\nabla_x U)dt + \sqrt{2D}dW_t</math>where <math>D</math> is diffusion tensor, <math>T</math> is temperature, and <math>U</math> is potential energy field. If we substitute in <math>D= \frac 12 \beta(t)I, k_BT = 1, U = \frac 12 \|x\|^2</math>, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.
 
Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to <math>q</math> at time <math>t=0</math>, then after a long time, the cloud of particles would settle into the stable distribution of <math>\mathcal{N}(0, I)</math>. Let <math>\rho_t</math> be the density of the cloud of particles at time <math>t</math>, then we have<math display="block">\rho_0 = q; \quad \rho_T \approx \mathcal{N}(0, I)</math>and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.
 
By [[Fokker–Planck equation|Fokker-Planck equation]], the density of the cloud evolves according to<math display="block">\partial_t \ln \rho_t = \frac 12 \beta(t) \left(
n + (x+ \nabla\ln\rho_t) \cdot \nabla \ln\rho_t + \Delta\ln\rho_t
\right)</math>where <math>n</math> is the dimension of space, and <math>\Delta</math> is the [[Laplace operator]]. Equivalently,<math display="block">\partial_t \rho_t = \frac 12 \beta(t) ( \nabla\cdot(x\rho_t) + \Delta \rho_t)</math>
 
==== Backward diffusion process ====
If we have solved <math>\rho_t</math> for time <math>t\in [0, T]</math>, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density <math>\nu_0 = \rho_T</math>, and let the particles in the cloud evolve according to

<math display="block">dy_t = \frac{1}{2} \beta(T-t) y_{t} d t + \beta(T-t) \underbrace{\nabla_{y_{t}} \ln \rho_{T-t}\left(y_{t}\right)}_{\text {score function }} d t+\sqrt{\beta(T-t)} d W_t</math>

then by plugging into the Fokker-Planck equation, we find that <math>\partial_t \rho_{T-t} = \partial_t \nu_t</math>. Thus this cloud of points is the original cloud, evolving backwards.<ref>{{Cite journal |last=Anderson |first=Brian D.O. |date=May 1982 |title=Reverse-time diffusion equation models |url=http://dx.doi.org/10.1016/0304-4149(82)90051-5 |journal=Stochastic Processes and Their Applications |volume=12 |issue=3 |pages=313–326 |doi=10.1016/0304-4149(82)90051-5 |issn=0304-4149|url-access=subscription }}</ref>
 
=== Noise conditional score network (NCSN) ===
Line 139 ⟶ 142:
and so
<math display="block">x_{t}|x_0 \sim N\left(e^{-\frac 12\int_0^t \beta(t)dt} x_{0}, \left(1- e^{-\int_0^t \beta(t)dt}\right) I \right)</math>
In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling <math>x_0 \sim q, z \sim \mathcal{N}(0, I)</math>, then get <math>x_t = e^{-\frac 12\int_0^t \beta(t)dt} x_{0} + \left(1- e^{-\int_0^t \beta(t)dt}\right) z</math>. That is, we can quickly sample <math>x_t \sim \rho_t</math> for any <math>t \geq 0</math>.
 
Now, define a certain probability distribution <math>\gamma</math> over <math>[0, \infty)</math>, then the score-matching loss function is defined as the expected Fisher divergence:
<math display="block">L(\theta) = E_{t\sim \gamma, x_t \sim \rho_t}[\|f_\theta(x_t, t)\|^2 + 2\nabla\cdot f_\theta(x_t, t)]</math>
After training, <math>f_\theta(x_t, t) \approx \nabla \ln\rho_t</math>, so we can perform the backwards diffusion process by first sampling <math>x_T \sim \mathcal{N}(0, I)</math>, then integrating the SDE from <math>t=T</math> to <math>t=0</math>:
<math display="block">x_{t-dt}=x_t + \frac{1}{2} \beta(t) x_{t} d t + \beta(t) f_\theta(x_t, t) d t+\sqrt{\beta(t)} d W_t</math>
This may be done by any SDE integration method, such as [[Euler–Maruyama method]].
Line 159 ⟶ 162:
<math display="block">\nabla_{x_t}\ln q(x_t) = \frac{1}{\sigma_{t}^2}(-x_t + \sqrt{\bar\alpha_t} E_q[x_0|x_t])</math>
As described previously, the DDPM loss function is <math>\sum_t L_{simple, t}</math> with
<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\| \epsilon_\theta(x_t, t) - z \right\|^2\right]</math>
where <math>x_t =\sqrt{\bar\alpha_t} x_{0} + \sigma_tz
</math>. By a change of variables,
Line 167 ⟶ 170:
and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have <math>\epsilon_\theta(x_t, t) = \frac{x_t -\sqrt{\bar\alpha_t} E_q[x_0|x_t]}{\sigma_t} = -\sigma_t\nabla_{x_t}\ln q(x_t)</math>
 
Thus, a score-based network predicts noise, and can be used for denoising diffusion.
 
Conversely, the continuous limit <math>x_{t-1} = x_{t-dt}, \beta_t = \beta(t) dt, z_t\sqrt{dt} = dW_t</math> of the backward equation
<math display="block">x_{t-1} = \frac{x_t}{\sqrt{\alpha_t}}- \frac{ \beta_t}{\sigma_{t}\sqrt{\alpha_t }} \epsilon_\theta(x_t, t) + \sqrt{\beta_t} z_t; \quad z_t \sim \mathcal{N}(0, I)</math>
gives us precisely the same equation as score-based diffusion:
<math display="block">x_{t-dt} = x_t(1+\beta(t)dt / 2) + \beta(t) \nabla_{x_t}\ln q(x_t) dt + \sqrt{\beta(t)}dW_t</math>Thus, aat denoisinginfinitesimal networksteps canof beDDPM, useda asdenoising network forperforms score-based diffusion.
 
== Main variants ==
Line 185 ⟶ 188:
 
=== Denoising Diffusion Implicit Model (DDIM) ===
The original DDPM method for generating images is slow, since the forward diffusion process usually takes <math>T \sim 1000</math> to make the distribution of <math>x_T</math> to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as <math>x_t | x_0</math> is gaussian for all <math>t \geq 1</math>, the backward diffusion process does not allow skipping steps. For example, to sample <math>x_{t-2}|x_{t-1} \sim \mathcal{N}(\mu_\theta(x_{t-1}, t-1), \Sigma_\theta(x_{t-1}, t-1))</math> requires the model to first sample <math>x_{t-1}</math>. Attempting to directly sample <math>x_{t-2}|x_t</math> would require us to marginalize out <math>x_{t-1}</math>, which is generally intractable.
 
DDIM<ref>{{Cite arXiv |last1=Song |first1=Jiaming |last2=Meng |first2=Chenlin |last3=Ermon |first3=Stefano |date=3 Oct 2023 |title=Denoising Diffusion Implicit Models |class=cs.LG |eprint=2010.02502}}</ref> is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.
Line 191 ⟶ 194:
In detail, the DDIM sampling method is as follows. Start with the forward diffusion process <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sigma_t \epsilon</math>. Then, during the backward denoising process, given <math>x_t, \epsilon_\theta(x_t, t)</math>, the original data is estimated as <math display="block">x_0' = \frac{x_t - \sigma_t \epsilon_\theta(x_t, t)}{ \sqrt{\bar\alpha_t}}</math>then the backward diffusion process can jump to any step <math>0 \leq s < t</math>, and the next denoised sample is <math display="block">x_{s} = \sqrt{\bar\alpha_{s}} x_0'
+ \sqrt{\sigma_{s}^2 - (\sigma'_s)^2} \epsilon_\theta(x_t, t)
+ \sigma_s' \epsilon</math>where <math>\sigma_s'</math> is an arbitrary real number within the range <math>[0, \sigma_s]</math>, and <math>\epsilon \sim \mathcal{N}(0, I)</math> is a newly sampled gaussian noise.<ref name=":7" /> If all <math>\sigma_s' = 0</math>, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.
 
The original paper recommended defining a single "eta value" <math>\eta \in [0, 1]</math>, such that <math>\sigma_s' = \eta \tilde\sigma_s</math>. When <math>\eta = 1</math>, this is the original DDPM. When <math>\eta = 0</math>, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them.
Line 199 ⟶ 202:
=== Latent diffusion model (LDM) ===
{{Main|Latent diffusion model}}
 
Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.<ref name=":2">{{Cite arXiv|last1=Rombach |first1=Robin |last2=Blattmann |first2=Andreas |last3=Lorenz |first3=Dominik |last4=Esser |first4=Patrick |last5=Ommer |first5=Björn |date=13 April 2022 |title=High-Resolution Image Synthesis With Latent Diffusion Models |class=cs.CV |eprint=2112.10752 }}</ref>
 
Line 204 ⟶ 208:
 
=== Architectural improvements ===
<ref>{{Cite journal |last1=Nichol |first1=Alexander Quinn |last2=Dhariwal |first2=Prafulla |date=2021-07-01 |title=Improved Denoising Diffusion Probabilistic Models |url=https://proceedings.mlr.press/v139/nichol21a.html |journal=Proceedings of the 38th International Conference on Machine Learning |language=en |publisher=PMLR |pages=8162–8171}}</ref> proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math>, they recommended sampling from <math>\mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), (\sigma_t^v \tilde\sigma_t^{1-v})^2 I)</math> for a learned parameter <math>v</math>.
 
In the ''v-prediction'' formalism, the noising formula <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon_t</math> is reparameterised by an angle <math>\phi_t</math> such that <math>\cos \phi_t = \sqrt{\bar\alpha_t}</math> and a "velocity" defined by <math>\cos\phi_t \epsilon_t - \sin\phi_t x_0</math>. The network is trained to predict the velocity <math>\hat v_\theta</math>, and denoising is by <math>x_{\phi_t - \delta} = \cos(\delta)\; x_{\phi_t} - \sin(\delta) \hat{v}_{\theta}\; (x_{\phi_t}) </math>.<ref>{{Cite conference|conference=The Tenth International Conference on Learning Representations (ICLR 2022)|last1=Salimans|first1=Tim|last2=Ho|first2=Jonathan|date=2021-10-06|title=Progressive Distillation for Fast Sampling of Diffusion Models|url=https://openreview.net/forum?id=TIdIXIpzhoI|language=en}}</ref> This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e. <math>\phi_t = 90^\circ</math>) and then reverse it, whereas the standard parameterization never reaches total noise since <math>\sqrt{\bar\alpha_t} > 0</math> is always true.<ref>{{Cite conference|conference=IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)|last1=Lin |first1=Shanchuan |last2=Liu |first2=Bingchen |last3=Li |first3=Jiashi |last4=Yang |first4=Xiao |date=2024 |title=Common Diffusion Noise Schedules and Sample Steps Are Flawed |url=https://openaccess.thecvf.com/content/WACV2024/html/Lin_Common_Diffusion_Noise_Schedules_and_Sample_Steps_Are_Flawed_WACV_2024_paper.html |language=en |pages=5404–5411}}</ref>
Line 261 ⟶ 265:
 
=== Other examples ===
Notable variants include<ref>{{Cite journal |last1=Cao |first1=Hanqun |last2=Tan |first2=Cheng |last3=Gao |first3=Zhangyang |last4=Xu |first4=Yilun |last5=Chen |first5=Guangyong |last6=Heng |first6=Pheng-Ann |last7=Li |first7=Stan Z. |date=July 2024 |title=A Survey on Generative Diffusion Models |url=https://ieeexplore.ieee.org/document/10419041 |journal=IEEE Transactions on Knowledge and Data Engineering |volume=36 |issue=7 |pages=2814–2830 |doi=10.1109/TKDE.2024.3361474 |bibcode=2024ITKDE..36.2814C |issn=1041-4347}}</ref> Poisson flow generative model,<ref>{{Cite journal |last1=Xu |first1=Yilun |last2=Liu |first2=Ziming |last3=Tian |first3=Yonglong |last4=Tong |first4=Shangyuan |last5=Tegmark |first5=Max |last6=Jaakkola |first6=Tommi |date=2023-07-03 |title=PFGM++: Unlocking the Potential of Physics-Inspired Generative Models |url=https://proceedings.mlr.press/v202/xu23m.html |journal=Proceedings of the 40th International Conference on Machine Learning |language=en |publisher=PMLR |pages=38566–38591|arxiv=2302.04265 }}</ref> consistency model,<ref>{{Cite journal |last1=Song |first1=Yang |last2=Dhariwal |first2=Prafulla |last3=Chen |first3=Mark |last4=Sutskever |first4=Ilya |date=2023-07-03 |title=Consistency Models |url=https://proceedings.mlr.press/v202/song23a |journal=Proceedings of the 40th International Conference on Machine Learning |language=en |publisher=PMLR |pages=32211–32252}}</ref> critically- damped Langevin diffusion,<ref>{{Cite arXiv |last1=Dockhorn |first1=Tim |last2=Vahdat |first2=Arash |last3=Kreis |first3=Karsten |date=2021-10-06 |title=Score-Based Generative Modeling with Critically-Damped Langevin Diffusion |class=stat.ML |eprint=2112.07068 }}</ref> GenPhys,<ref>{{cite arXiv |last1=Liu |first1=Ziming |title=GenPhys: From Physical Processes to Generative Models |date=2023-04-05 |eprint=2304.02637 |last2=Luo |first2=Di |last3=Xu |first3=Yilun |last4=Jaakkola |first4=Tommi |last5=Tegmark |first5=Max|class=cs.LG }}</ref> cold diffusion,<ref>{{Cite journal |last1=Bansal |first1=Arpit |last2=Borgnia |first2=Eitan |last3=Chu |first3=Hong-Min |last4=Li |first4=Jie |last5=Kazemi |first5=Hamid |last6=Huang |first6=Furong |last7=Goldblum |first7=Micah |last8=Geiping |first8=Jonas |last9=Goldstein |first9=Tom |date=2023-12-15 |title=Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/80fe51a7d8d0c73ff7439c2a2554ed53-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=41259–41282|arxiv=2208.09392 }}</ref> discrete diffusion,<ref>{{Cite journal |last1=Gulrajani |first1=Ishaan |last2=Hashimoto |first2=Tatsunori B. |date=2023-12-15 |title=Likelihood-Based Diffusion Language Models |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/35b5c175e139bff5f22a5361270fce87-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=16693–16715|arxiv=2305.18619 }}</ref><ref>{{cite arXiv |last1=Lou |first1=Aaron |title=Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution |date=2024-06-06 |eprint=2310.16834 |last2=Meng |first2=Chenlin |last3=Ermon |first3=Stefano|class=stat.ML }}</ref> etc.
 
== Flow-based diffusion model ==
Line 285 ⟶ 289:
 
=== Optimal transport flow ===
The idea of '''optimal transport flow''' <ref>{{Cite journal |last1=Tong |first1=Alexander |last2=Fatras |first2=Kilian |last3=Malkin |first3=Nikolay |last4=Huguet |first4=Guillaume |last5=Zhang |first5=Yanlei |last6=Rector-Brooks |first6=Jarrid |last7=Wolf |first7=Guy |last8=Bengio |first8=Yoshua |date=2023-11-08 |title=Improving and generalizing flow-based generative models with minibatch optimal transport |url=https://openreview.net/forum?id=CD9Snc73AW |journal=Transactions on Machine Learning Research |arxiv=2302.00482 |language=en |issn=2835-8856}}</ref> is to construct a probability path minimizing the [[Wasserstein metric]]. The distribution on which we condition is an approximation of the optimal transport plan between <math>\pi_0 </math> and <math>\pi_1
</math>: <math>z = (x_0, x_1) </math> and <math>q(z) = \Gamma(\pi_0, \pi_1) </math>, where <math>\Gamma</math> is the optimal transport plan, which can be approximated by '''mini-batch optimal transport.''' If the batch size is not large, then the transport it computes can be very far from the true optimal transport.
 
=== Rectified flow ===
Line 306 ⟶ 310:
<math display="block">\min_{\theta} \int_0^1 \mathbb{E}_{\pi_0, \pi_1, p_t}\left [\lVert{(x_1-x_0) - v_t(x_t)}\rVert^2\right] \,\mathrm{d}t.</math>
 
The data pair <math>(x_0, x_1)</math> can be any coupling of <math>\pi_0</math> and <math>\pi_1</math>, typically independent (i.e., <math>(x_0,x_1) \sim \pi_0 \times \pi_1</math>) obtained by randomly combining observations from <math>\pi_0</math> and <math>\pi_1</math>. This process ensures that the trajectories closely mirror the density map of <math>x_t</math> trajectories but ''reroute'' at intersections to ensure causality. This rectifying process is also known as Flow Matching,<ref>{{cite arXiv |last1=Lipman |first1=Yaron |title=Flow Matching for Generative Modeling |date=2023-02-08 |eprint=2210.02747 |last2=Chen |first2=Ricky T. Q. |last3=Ben-Hamu |first3=Heli |last4=Nickel |first4=Maximilian |last5=Le |first5=Matt|class=cs.LG }}</ref> Stochastic Interpolation,<ref>{{cite arXiv |last1=Albergo |first1=Michael S. |title=Building Normalizing Flows with Stochastic Interpolants |date=2023-03-09 |eprint=2209.15571 |last2=Vanden-Eijnden |first2=Eric|class=cs.LG }}</ref> and Alpha-Blending.{{Citation needed|date=April 2024}}
 
[[File:Reflow Illustration.png|thumb|390px|The reflow process<ref name=":0"/>]]
Line 313 ⟶ 317:
 
Rectified flow includes a nonlinear extension where linear interpolation <math>x_t</math> is replaced with any time-differentiable curve that connects <math>x_0</math> and <math>x_1</math>, given by <math>x_t = \alpha_t x_1 + \beta_t x_0</math>. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of <math>\alpha_t</math> and <math>\beta_t</math>. However, in the case where the path of <math>x_t</math> is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of <math>\phi_t</math>.<ref name=":0" />
 
See <ref>{{Cite web |title=An introduction to Flow Matching · Cambridge MLG Blog |url=https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html |access-date=2024-08-20 |website=mlg.eng.cam.ac.uk}}</ref> for a tutorial on flow matching, with animations.
 
== Choice of architecture ==
Line 379 ⟶ 381:
Muse (2023-01)<ref>{{cite arXiv |last1=Chang |first1=Huiwen |title=Muse: Text-To-Image Generation via Masked Generative Transformers |date=2023-01-02 |eprint=2301.00704 |last2=Zhang |first2=Han |last3=Barber |first3=Jarred |last4=Maschinot |first4=A. J. |last5=Lezama |first5=Jose |last6=Jiang |first6=Lu |last7=Yang |first7=Ming-Hsuan |last8=Murphy |first8=Kevin |last9=Freeman |first9=William T.|class=cs.CV }}</ref> is not a diffusion model, but an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens.
 
Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available.<ref>{{Cite web |title=Imagen 2 - our most advanced text-to-image technology |url=https://deepmind.google/technologies/imagen-2/ |access-date=2024-04-04 |website=Google DeepMind |language=en}}</ref> Imagen 3 (2024-05) is too. No further information available.<ref>{{Citation |last1=Imagen-Team-Google |title=Imagen 3 |date=2024-12-13 |url=https://arxiv.org/abs/2408.07009 |access-date=2024-12-23 |arxiv=2408.07009 |last2=Baldridge |first2=Jason |last3=Bauer |first3=Jakob |last4=Bhutani |first4=Mukul |last5=Brichtova |first5=Nicole |last6=Bunner |first6=Andrew |last7=Castrejon |first7=Lluis |last8=Chan |first8=Kelvin |last9=Chen |first9=Yichang}}</ref>
 
Veo (2024) generates videos by latent diffusion. The diffusion is conditioned on a vector that encodes both a text prompt and an image prompt.<ref>{{Cite web |date=2024-05-14 |title=Veo |url=https://deepmind.google/technologies/veo/ |access-date=2024-05-17 |website=Google DeepMind |language=en}}</ref>
Line 403 ⟶ 405:
** {{Cite journal |last1=Yang |first1=Ling |last2=Zhang |first2=Zhilong |last3=Song |first3=Yang |last4=Hong |first4=Shenda |last5=Xu |first5=Runsheng |last6=Zhao |first6=Yue |last7=Zhang |first7=Wentao |last8=Cui |first8=Bin |last9=Yang |first9=Ming-Hsuan |date=2023-11-09 |title=Diffusion Models: A Comprehensive Survey of Methods and Applications |url=https://dl.acm.org/doi/abs/10.1145/3626235 |journal=ACM Comput. Surv. |volume=56 |issue=4 |pages=105:1–105:39 |doi=10.1145/3626235 |issn=0360-0300|arxiv=2209.00796 }}
** {{ Cite arXiv | eprint=2107.03006 | last1=Austin | first1=Jacob | last2=Johnson | first2=Daniel D. | last3=Ho | first3=Jonathan | last4=Tarlow | first4=Daniel | author5=Rianne van den Berg | title=Structured Denoising Diffusion Models in Discrete State-Spaces | date=2021 | class=cs.LG }}
** {{Cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023-09-01 |title=Diffusion Models in Vision: A Survey |url=https://ieeexplore.ieee.org/document/10081412 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=9 |pages=10850–10869 |doi=10.1109/TPAMI.2023.3261988 |pmid=37030794 |issn=0162-8828|arxiv=2209.04747 |bibcode=2023ITPAM..4510850C }}
* Mathematical details omitted in the article.
** {{Cite web |date=2022-09-25 |title=Power of Diffusion Models |url=https://astralord.github.io/posts/power-of-diffusion-models/ |access-date=2023-09-25 |website=AstraBlog |language=en}}
Line 417 ⟶ 419:
[[Category:Markov models]]
[[Category:Machine learning algorithms]]
__FORCETOC__