Diffusion model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 03:16, 14 March 2024 edit Filedelinkerbot (talk \| contribs) Bots, Rollbackers 290,459 edits Bot: Removing Commons:File:U-net-architecture.png (en). It was deleted on Commons by Fitindia (No permission since 6 March 2024). ← Previous edit		Latest revision as of 14:52, 25 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,548 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 658/1032
(199 intermediate revisions by 45 users not shown)
Line 1: {{Short description\|Deep learning algorithm}}{{About\|the technique in generative statistical modeling\|3=Diffusion (disambiguation)}} {{Machine learning\|Artificial neural network}} {{Machine learning\|Artificial neural network}}In [[machine learning]], '''diffusion models''', also known as '''diffusion probabilistic models''' or '''score-based generative models''', are a class of [[latent variable model\|latent variable]] [[generative model\|generative]] models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.<ref name="chang23design">{{cite arXiv \|last1=Chang \|first1=Ziyi \|last2=Koulieris \|first2=George Alex \|last3=Shum \|first3=Hubert P. H. \|title=On the Design Fundamentals of Diffusion Models: A Survey \|date=2023 \|eprint=2306.04542 \|class=cs.LG}}</ref> The goal of diffusion models is to learn a [[diffusion process]] that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their [[latent space]].<ref name="song"/> In [[machine learning]], '''diffusion models''', also known as '''diffusion-based generative models''' or '''score-based generative models''', are a class of [[latent variable model\|latent variable]] [[generative model\|generative]] models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a [[diffusion process]] for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a [[Wiener process\|random walk with drift]] through the space of all possible data.<ref name="song"/> A trained diffusion model can be sampled in many ways, with different efficiency and quality. In the case of [[computer vision]], diffusion models can be applied to a variety of tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], and [[text-to-image model\|image generation]]. This typically involves training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv \|last1=Song \|first1=Yang \|last2=Sohl-Dickstein \|first2=Jascha \|last3=Kingma \|first3=Diederik P. \|last4=Kumar \|first4=Abhishek \|last5=Ermon \|first5=Stefano \|last6=Poole \|first6=Ben \|date=2021-02-10 \|title=Score-Based Generative Modeling through Stochastic Differential Equations \|class=cs.LG \|eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv \|last1=Gu \|first1=Shuyang \|last2=Chen \|first2=Dong \|last3=Bao \|first3=Jianmin \|last4=Wen \|first4=Fang \|last5=Zhang \|first5=Bo \|last6=Chen \|first6=Dongdong \|last7=Yuan \|first7=Lu \|last8=Guo \|first8=Baining \|title=Vector Quantized Diffusion Model for Text-to-Image Synthesis \|date=2021 \|class=cs.CV \|eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise for the network to iteratively denoise. Announced on 13 April 2022, [[OpenAI]]'s text-to-image model [[DALL-E 2]] is an example that uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image.<ref name="dalle2"/> Diffusion models have recently found applications in natural language processing (NLP),<ref>{{Cite journal \|last=Li \|first=Yifan \|last2=Zhou \|first2=Kun \|last3=Zhao \|first3=Wayne Xin \|last4=Wen \|first4=Ji-Rong \|date=August 2023 \|title=Diffusion Models for Non-autoregressive Text Generation: A Survey \|url=http://dx.doi.org/10.24963/ijcai.2023/750 \|journal=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence \|___location=California \|publisher=International Joint Conferences on Artificial Intelligence Organization \|doi=10.24963/ijcai.2023/750\|arxiv=2303.06574 }}</ref> particularly in areas like text generation<ref>{{Cite journal \|last=Han \|first=Xiaochuang \|last2=Kumar \|first2=Sachin \|last3=Tsvetkov \|first3=Yulia \|date=2023 \|title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control \|url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 \|journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.acl-long.647\|arxiv=2210.17432 }}</ref><ref>{{Cite journal \|last=Xu \|first=Weijie \|last2=Hu \|first2=Wenxiang \|last3=Wu \|first3=Fanyou \|last4=Sengamedu \|first4=Srinivasan \|date=2023 \|title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM \|url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 \|journal=Findings of the Association for Computational Linguistics: EMNLP 2023 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-emnlp.606\|arxiv=2310.15296 }}</ref> and summarization.<ref>{{Cite journal \|last=Zhang \|first=Haopeng \|last2=Liu \|first2=Xiao \|last3=Zhang \|first3=Jiawei \|date=2023 \|title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion \|url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 \|journal=Findings of the Association for Computational Linguistics: ACL 2023 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-acl.828\|arxiv=2305.01735 }}</ref> ~~Diffusion models~~There are ~~typically~~various ~~formulated~~equivalent asformalisms, including [[~~markov~~Markov chain]]s ~~and trained using [[Variational Bayesian methods\|variational inference]].<ref name="ho"/> Examples of generic diffusion modeling frameworks used in computer vision are~~, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.<ref>{{cite journal \|last1= Croitoru \|first1=Florinel-Alin \|last2= Hondru \|first2= Vlad \|last3= Ionescu \|first3=Radu Tudor \|last4= Shah \|first4= Mubarak \|date=2023 \|title=Diffusion Models in Vision: A Survey \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence ~~\|date=2023~~ \|volume=45 \|issue=9 \|pages=10850–10869 \|arxiv=2209.04747 \|doi=10.1109/TPAMI.2023.3261988 \|pmid=37030794 \|~~arxiv~~bibcode=~~2209~~2023ITPAM.~~04747~~.4510850C \|s2cid=252199918 }}</ref> They are typically trained using [[Variational Bayesian methods\|variational inference]].<ref name="ho" /> The model responsible for denoising is typically called its "[[#Choice of architecture\|backbone]]". The backbone may be of any kind, but they are typically [[U-Net\|U-nets]] or [[Transformer (deep learning architecture)\|transformers]]. {{As of\|2024}}, diffusion models are mainly used for [[computer vision]] tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], [[text-to-image model\|image generation]], and video generation. These typically involve training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv \|last1=Song \|first1=Yang \|last2=Sohl-Dickstein \|first2=Jascha \|last3=Kingma \|first3=Diederik P. \|last4=Kumar \|first4=Abhishek \|last5=Ermon \|first5=Stefano \|last6=Poole \|first6=Ben \|date=2021-02-10 \|title=Score-Based Generative Modeling through Stochastic Differential Equations \|class=cs.LG \|eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv \|last1=Gu \|first1=Shuyang \|last2=Chen \|first2=Dong \|last3=Bao \|first3=Jianmin \|last4=Wen \|first4=Fang \|last5=Zhang \|first5=Bo \|last6=Chen \|first6=Dongdong \|last7=Yuan \|first7=Lu \|last8=Guo \|first8=Baining \|title=Vector Quantized Diffusion Model for Text-to-Image Synthesis \|date=2021 \|class=cs.CV \|eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as [[Stable Diffusion]] and [[DALL-E]]. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.<ref name="dalle2" /> Other than computer vision, diffusion models have also found applications in [[natural language processing]]<ref>{{ Cite arXiv \|eprint=2410.18514 \|last1=Nie \|first1=Shen \|last2=Zhu \|first2=Fengqi \|last3=Du \|first3=Chao \|last4=Pang \|first4=Tianyu \|last5=Liu \|first5=Qian \|last6=Zeng \|first6=Guangtao \|last7=Lin \|first7=Min \|last8=Li \|first8=Chongxuan \|title=Scaling up Masked Diffusion Models on Text \|date=2024 \|class=cs.AI }}</ref><ref>{{ Cite book \|last1=Li \|first1=Yifan \|last2=Zhou \|first2=Kun \|last3=Zhao \|first3=Wayne Xin \|last4=Wen \|first4=Ji-Rong \|chapter=Diffusion Models for Non-autoregressive Text Generation: A Survey \|date=August 2023 \|pages=6692–6701 \|title=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence \|chapter-url=http://dx.doi.org/10.24963/ijcai.2023/750 \|___location=California \|publisher=International Joint Conferences on Artificial Intelligence Organization \|doi=10.24963/ijcai.2023/750\|arxiv=2303.06574 \|isbn=978-1-956792-03-4 }}</ref> such as [[Natural language generation\|text generation]]<ref>{{Cite journal \|last1=Han \|first1=Xiaochuang \|last2=Kumar \|first2=Sachin \|last3=Tsvetkov \|first3=Yulia \|date=2023 \|title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control \|url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 \|journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|pages=11575–11596 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.acl-long.647\|arxiv=2210.17432 }}</ref><ref>{{Cite journal \|last1=Xu \|first1=Weijie \|last2=Hu \|first2=Wenxiang \|last3=Wu \|first3=Fanyou \|last4=Sengamedu \|first4=Srinivasan \|date=2023 \|title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM \|url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 \|journal=Findings of the Association for Computational Linguistics: EMNLP 2023 \|pages=9040–9057 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-emnlp.606\|arxiv=2310.15296 }}</ref> and [[Automatic summarization\|summarization]],<ref>{{Cite journal \|last1=Zhang \|first1=Haopeng \|last2=Liu \|first2=Xiao \|last3=Zhang \|first3=Jiawei \|date=2023 \|title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion \|url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 \|journal=Findings of the Association for Computational Linguistics: ACL 2023 \|pages=13089–13100 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-acl.828\|arxiv=2305.01735 }}</ref> sound generation,<ref>{{Cite journal \|last1=Yang \|first1=Dongchao \|last2=Yu \|first2=Jianwei \|last3=Wang \|first3=Helin \|last4=Wang \|first4=Wen \|last5=Weng \|first5=Chao \|last6=Zou \|first6=Yuexian \|last7=Yu \|first7=Dong \|date=2023 \|title=Diffsound: Discrete Diffusion Model for Text-to-Sound Generation \|url=http://dx.doi.org/10.1109/taslp.2023.3268730 \|journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing \|volume=31 \|pages=1720–1733 \|doi=10.1109/taslp.2023.3268730 \|issn=2329-9290\|arxiv=2207.09983 \|bibcode=2023ITASL..31.1720Y }}</ref> and reinforcement learning.<ref>{{cite arXiv \|last1=Janner \|first1=Michael \|title=Planning with Diffusion for Flexible Behavior Synthesis \|date=2022-12-20 \|eprint=2205.09991 \|last2=Du \|first2=Yilun \|last3=Tenenbaum \|first3=Joshua B. \|last4=Levine \|first4=Sergey\|class=cs.LG }}</ref><ref>{{cite arXiv \|last1=Chi \|first1=Cheng \|title=Diffusion Policy: Visuomotor Policy Learning via Action Diffusion \|date=2024-03-14 \|eprint=2303.04137 \|last2=Xu \|first2=Zhenjia \|last3=Feng \|first3=Siyuan \|last4=Cousineau \|first4=Eric \|last5=Du \|first5=Yilun \|last6=Burchfiel \|first6=Benjamin \|last7=Tedrake \|first7=Russ \|last8=Song \|first8=Shuran\|class=cs.RO }}</ref> == Denoising diffusion model == === Non-equilibrium thermodynamics === Diffusion models were introduced in 2015 as a method to ~~learn~~train a model that can sample from a highly complex probability distribution. They used techniques from [[non-equilibrium thermodynamics]], especially [[diffusion]].<ref>{{Cite journal \|last1=Sohl-Dickstein \|first1=Jascha \|last2=Weiss \|first2=Eric \|last3=Maheswaranathan \|first3=Niru \|last4=Ganguli \|first4=Surya \|date=2015-06-01 \|title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics \|url=http://proceedings.mlr.press/v37/sohl-dickstein15.pdf \|journal=Proceedings of the 32nd International Conference on Machine Learning \|language=en \|publisher=PMLR \|volume=37 \|pages=2256–2265\|arxiv=1503.03585 }}</ref> Consider, for example, how one might model the distribution of all naturally- occurring photos. Each image is a point in the space of all images, and the distribution of naturally- occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a [[Normal distribution\|Gaussian distribution]] <math>\mathcal{N}(0, I)</math>. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution <math>\mathcal{N}(0, I)</math>, with pdf <math>\rho(x) \propto e^{-\frac 12 \\|x\\|^2}</math>. This is just the [[Maxwell–Boltzmann distribution]] of particles in a potential well <math>V(x) = \frac 12 \\|x\\|^2</math> at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a [[Brownian motion\|Brownian walker]]) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. === Denoising Diffusion Probabilistic Model (DDPM) === The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by [[Variational Bayesian methods\|variational inference]].<ref name="ho">{{Cite journal \|last1=Ho \|first1=Jonathan \|last2=Jain \|first2=Ajay \|last3=Abbeel \|first3=Pieter \|date=2020 \|title=Denoising Diffusion Probabilistic Models \|url=https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=33 \|pages=6840–6851}}</ref><ref>{{Citation \|last=Ho \|first=Jonathan \|title=hojonathanho/diffusion \|date=Jun 20, 2020 \|url=https://github.com/hojonathanho/diffusion \|access-date=2024-09-07}}</ref> ==== Forward diffusion ==== Line 24 ⟶ 30: * <math>\alpha_t := 1-\beta_t</math> * <math>\bar \alpha_t := \alpha_1 \cdots \alpha_t</math> * <math>\~~tilde \beta_t~~sigma_t := \~~frac~~sqrt{1~~-\bar~~ ~~\alpha_{t-1}}{1~~-\bar \~~alpha_{t~~alpha_t}~~}\beta_t~~</math> * <math>\tilde~~\mu_t(x_t,~~ ~~x_0)~~\sigma_t := \frac{\~~sqrt{\alpha_{t}}(1-\bar \alpha_~~sigma_{t-1}~~)x_t +\sqrt~~}{\~~bar\alpha_~~sigma_{t-1}}~~(1-~~\~~alpha_~~sqrt{~~t})x_0}{1-~~\~~bar\alpha_{t}~~beta_t}</math> * <math>\tilde\mu_t(x_t, x_0) :=\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_{t-1})x_t +\sqrt{\bar\alpha_{t-1}}(1-\alpha_{t})x_0}{\sigma_{t}^2}</math> * <math>N(\mu, \Sigma)</math> is the normal distribution with mean <math>\mu</math> and variance <math>\Sigma</math>, and <math>N(x \| \mu, \Sigma)</math> is the probability density at <math>x</math>. * <math>\mathcal{N}(\mu, \Sigma)</math> is the normal distribution with mean <math>\mu</math> and variance <math>\Sigma</math>, and <math>\mathcal{N}(x \| \mu, \Sigma)</math> is the probability density at <math>x</math>. * A vertical bar denotes [[Conditioning (probability)\|conditioning]]. A '''forward diffusion process''' starts at some starting point <math>x_0 \sim q</math>, where <math>q</math> is the probability distribution to be learned, then repeatedly adds noise to it by<math display="block">x_t = \sqrt{1-\beta_t} x_{t-1} + \sqrt{\beta_t} z_t</math>where <math>z_1, ..., z_T</math> are IID samples from <math>\mathcal{N}(0, I)</math>. This is designed so that for any starting distribution of <math>x_0</math>, we have <math>\lim_t x_t\|x_0</math> converging to <math>\mathcal{N}(0, I)</math>. The entire diffusion process then satisfies<math display="block">q(x_{0:T}) = q(x_0)q(x_1\|x_0) \cdots q(x_T\|x_{T-1}) = q(x_0) \mathcal{N}(x_1 \| \sqrt{\alpha_1} x_0, \beta_1 I) \cdots \mathcal{N}(x_T \| \sqrt{\alpha_T} x_{T-1}, \beta_T I)</math>or<math display="block">\ln q(x_{0:T}) = \ln q(x_0) - \sum_{t=1}^T \frac{1}{2\beta_t} \\| x_t - \sqrt{1-\beta_t}x_{t-1}\\|^2 + C</math>where <math>C</math> is a normalization constant and often omitted. In particular, we note that <math>x_{1:T}\|x_0</math> is a [[gaussian process]], which affords us considerable freedom in [[Reparameterization trick\|reparameterization]]. For example, by standard manipulation with gaussian process, <math display="block">x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, ~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2 I \right)</math><math display="block">x_{t-1} \| x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde \~~beta_t~~sigma_t^2 I)</math>In particular, notice that for large <math>t</math>, the variable <math>x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, ~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2 I \right)</math> converges to <math>\mathcal{N}(0, I)</math>. That is, after a long enough diffusion process, we end up with some <math>x_T</math> that is very close to <math>\mathcal{N}(0, I)</math>, with all traces of the original <math>x_0 \sim q</math> gone. For example, since<math display="block">x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, ~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2 I \right)</math>we can sample <math>x_{t}\|x_0</math> directly "in one step", instead of going through all the intermediate steps <math>x_1, x_2, ..., x_{t-1}</math>. {{Math proof\|title=Derivation by reparameterization\|proof= We know <math display="inline">x_{t-1}\|x_0</math> is a gaussian, and <math display="inline">x_t\|x_{t-1}</math> is another gaussian. We also know that these are independent. Thus we can perform a reparameterization: <math display="block">x_{t-1} = \sqrt{\bar\alpha_{t-1}} x_{0} + \sqrt{1 - \bar\alpha_{t-1}} z</math> <math display="block">x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} z'</math> where <math display="inline">z, z'</math> are IID gaussians. There are 5 variables <math display="inline">x_0, x_{t-1}, x_t, z, z'</math> and two linear equations. The two sources of randomness are <math display="inline">z, z'</math>, which can be reparameterized by rotation, since the IID gaussian distribution is rotationally symmetric. By plugging in the equations, we can solve for the first reparameterization: <math display="block">x_t = \sqrt{\bar \alpha_t}x_0 + \underbrace{\sqrt{\alpha_t - \bar\alpha_t}z + \sqrt{1-\alpha_t}z'}_{= \~~sqrt{1-\bar\alpha_t}~~sigma_t z''}</math> where <math display="inline">z''</math> is a gaussian with mean zero and variance one. To find the second one, we complete the rotational matrix: <math display="block">\begin{bmatrix}z'' \\z'''\end{bmatrix} = \begin{bmatrix} \frac{\sqrt{\alpha_t - \bar\alpha_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} & \frac{\sqrt{\beta_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} \\?&?\end{bmatrix} \begin{bmatrix} z\\z'\end{bmatrix}</math> Since rotational matrices are all of the form <math display="inline">\begin{bmatrix} \cos\theta & \sin\theta\\ -\sin\theta & \cos\theta \end{bmatrix}</math>, we know the matrix must be <math display="block">\begin{bmatrix}z'' \\z'''\end{bmatrix} = \begin{bmatrix} \frac{\sqrt{\alpha_t - \bar\alpha_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} & \frac{\sqrt{\beta_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} \\- \frac{\sqrt{\beta_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} & \frac{\sqrt{\alpha_t - \bar\alpha_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} \end{bmatrix} \begin{bmatrix} z\\z'\end{bmatrix}</math> and since the inverse of rotational matrix is its transpose,<br /> <math display="block">\begin{bmatrix}z \\z'\end{bmatrix} = \begin{bmatrix} \frac{\sqrt{\alpha_t - \bar\alpha_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} & -\frac{\sqrt{\beta_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} \\ \frac{\sqrt{\beta_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} & \frac{\sqrt{\alpha_t - \bar\alpha_t}}{\~~sqrt{1-\bar\alpha_t}~~sigma_t} \end{bmatrix} \begin{bmatrix} z''\\z'''\end{bmatrix}</math> Plugging back, and simplifying, we have <math display="block">x_t = \sqrt{\bar\alpha_t}x_0 + \~~sqrt{1-\bar\alpha_t}z~~sigma_tz''</math> <math display="block">x_{t-1} = \tilde\mu_t(x_t, x_0) - ~~\sqrt{~~\tilde \~~beta_t}~~sigma_t z'''</math> }} ==== Backward diffusion ==== The key idea of DDPM is to use a neural network parametrized by <math>\theta</math>. The network takes in two arguments <math>x_t, t</math>, and outputs a vector <math>\mu_\theta(x_t, t)</math> and a matrix <math>\Sigma_\theta(x_t, t)</math>, such that each step in the forward diffusion process can be approximately undone by <math>x_{t-1} \sim \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>. This then gives us a backward diffusion process <math>p_\theta</math> defined by<math display="block">p_\theta(x_T) = \mathcal{N}(x_T \| 0, I)</math><math display="block">p_\theta(x_{t-1} \| x_t) = \mathcal{N}(x_{t-1} \| \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))</math>The goal now is to learn the parameters such that <math>p_\theta(x_0)</math> is as close to <math>q(x_0)</math> as possible. To do that, we use [[maximum likelihood estimation]] with variational inference. ==== Variational inference ==== The [[Evidence lower bound\|ELBO inequality]] states that <math>\ln p_\theta(x_0) \geq E_{x_{1:T}\sim q(\cdot \| x_0)}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}\|x_0)] </math>, and taking one more expectation, we get<math display="block">E_{x_0 \sim q}[\ln p_\theta(x_0)] \geq E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}\|x_0)] </math>We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference. Define the loss function<math display="block">L(\theta) := -E_{x_{0:T}\sim q}[ \ln p_\theta(x_{0:T}) - \ln q(x_{1:T}\|x_0)]</math>and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to<ref name=":7">{{Cite web \|last=Weng \|first=Lilian \|date=2021-07-11 \|title=What are Diffusion Models? \|url=https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ \|access-date=2023-09-24 \|website=lilianweng.github.io \|language=en}}</ref><math display="block">L(\theta) = \sum_{t=1}^T E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} \| x_t)] + E_{x_0 \sim q}[D_{KL}(q(x_T\|x_0) \\| p_\theta(x_T))] + C</math>where <math>C</math> does not depend on the parameter, and thus can be ignored. Since <math>p_\theta(x_T) = \mathcal{N}(x_T \| 0, I)</math> also does not depend on the parameter, the term <math>E_{x_0 \sim q}[D_{KL}(q(x_T\|x_0) \\| p_\theta(x_T))]</math> can also be ignored. This leaves just <math>L(\theta ) = \sum_{t=1}^T L_t</math> with <math>L_t = E_{x_{t-1}, x_t\sim q}[-\ln p_\theta(x_{t-1} \| x_t)]</math> to be minimized. ==== Noise prediction network ==== Since <math>x_{t-1} \| x_t, x_0 \sim \mathcal{N}(\tilde\mu_t(x_t, x_0), \tilde \~~beta_t~~sigma_t^2 I)</math>, this suggests that we should use <math>\mu_\theta(x_t, t) = \tilde \mu_t(x_t, x_0)</math>; however, the network does not have access to <math>x_0</math>, and so it has to estimate it instead. Now, since <math>x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, ~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2 I \right)</math>, we may write <math>x_t = \sqrt{\bar\alpha_t} x_{0} + \~~sqrt{1-\bar\alpha_t}~~sigma_t z</math>, where <math>z</math> is some unknown gaussian noise. Now we see that estimating <math>x_0</math> is equivalent to estimating <math>z</math>. Therefore, let the network output a noise vector <math>\epsilon_\theta(x_t, t)</math>, and let it predict<math display="block">\mu_\theta(x_t, t) =\tilde\mu_t\left(x_t, \frac{x_t - \~~sqrt{1-\bar\alpha_t}~~sigma_t \epsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}\right) = \frac{x_t - \epsilon_\theta(x_t, t) \beta_t/\~~sqrt{1-\bar\alpha_t}~~sigma_t}{\sqrt{\alpha_t}}</math>It remains to design <math>\Sigma_\theta(x_t, t)</math>. The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value <math>\Sigma_\theta(x_t, t) = \~~sigma_t~~zeta_t^2 I</math>, where either <math>\~~sigma_t~~zeta_t^2 = \beta_t \text{ or } \tilde \~~beta_t~~sigma_t^2</math> yielded similar performance. With this, the loss simplifies to <math display="block">L_t = \frac{\beta_t^2}{2\alpha_t~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2\~~sigma_t~~zeta_t^2} E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\\| \epsilon_\theta(x_t, t) - z \right\\|^2\right] + C</math>which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\\| \epsilon_\theta(x_t, t) - z \right\\|^2\right]</math>resulted in better models. === Backward diffusion process === After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows: # Compute the noise estimate <math>\epsilon \leftarrow \epsilon_\theta(x_t, t)</math> # Compute the original data estimate <math>\tilde x_0 \leftarrow (x_t - \sigma_t \epsilon) / \sqrt{\bar \alpha_t} </math> # Sample the previous data <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math> # Change time <math>t \leftarrow t-1</math> == Score-based generative model == Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).<ref>{{Cite web \|title=Generative Modeling by Estimating Gradients of the Data Distribution {{!}} Yang Song \|url=https://yang-song.net/blog/2021/score/ \|access-date=2023-09-24 \|website=yang-song.net}}</ref><ref name=":9">{{Cite journal \|last1=Song \|first1=Yang \|last2=Ermon \|first2=Stefano \|date=2019 \|title=Generative Modeling by Estimating Gradients of the Data Distribution \|url=https://proceedings.neurips.cc/paper/2019/hash/3001ef257407d5a371a96dcd947c7d93-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32\|arxiv=1907.05600 }}</ref><ref name=":1">{{Cite arXiv \|eprint=2011.13456 \|class=cs.LG \|first1=Yang \|last1=Song \|first2=Jascha \|last2=Sohl-Dickstein \|title=Score-Based Generative Modeling through Stochastic Differential Equations \|date=2021-02-10 \|last3=Kingma \|first3=Diederik P. \|last4=Kumar \|first4=Abhishek \|last5=Ermon \|first5=Stefano \|last6=Poole \|first6=Ben}}</ref><ref>{{Citation \|title=ermongroup/ncsn \|date=2019 \|url=https://github.com/ermongroup/ncsn \|access-date=2024-09-07 \|publisher=ermongroup}}</ref> === Score matching === Line 92 ⟶ 107: As it turns out, <math>s(x)</math> allows us to sample from <math>q(x)</math> using thermodynamics. Specifically, if we have a potential energy function <math>U(x) = -\ln q(x)</math>, and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the [[Boltzmann distribution]] <math>q_U(x) \propto e^{-U(x)/k_B T} = q(x)^{1/k_BT}</math>. At temperature <math>k_BT=1</math>, the Boltzmann distribution is exactly <math>q(x)</math>. Therefore, to model <math>q(x)</math>, we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the [[Langevin equation]] <math display="block">dx_{t}= -\nabla_{x_t}U(x_t) d t+d W_t</math> and the Boltzmann distribution is, [[Fokker–Planck equation#Boltzmann distribution at the thermodynamic equilibrium\|by Fokker-Planck equation, the unique thermodynamic equilibrium]]. So no matter what distribution <math>x_0</math> has, the distribution of <math>x_t</math> converges in distribution to <math>q</math> as <math>t\to \infty</math>. ==== Learning the score function ==== Given a density <math>q</math>, we wish to learn a score function approximation <math>f_\theta \approx \nabla \ln q</math>. This is '''score matching'''''.''<ref>{{Cite web \|title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation {{!}} Yang Song \|url=https://yang-song.net/blog/2019/ssm/ \|access-date=2023-09-24 \|website=yang-song.net}}</ref> Typically, score matching is formalized as minimizing '''Fisher divergence''' function <math>E_q[\\|f_\theta(x) - \nabla \ln q(x)\\|^2]</math>. By expanding the integral, and performing an integration by parts, <math display="block">E_q[\\|f_\theta(x) - \nabla \ln q(x)\\|^2] = E_q[\\|f_\theta\\|^2 + 2\nabla^2\cdot f_\theta] + C</math>giving us a loss function, also known as the [[~~Scoring_rule~~Scoring rule#~~Hyvärinen_scoring_rule~~Hyvärinen scoring rule\|Hyvärinen scoring rule]], that can be minimized by stochastic gradient descent. ==== Annealing the score function ==== Suppose we need to model the distribution of images, and we want <math>x_0 \sim \mathcal{N}(0, I)</math>, a white-noise image. Now, most white-noise images do not look like real images, so <math>q(x_0) \approx 0</math> for large swaths of <math>x_0 \sim \mathcal{N}(0, I)</math>. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function <math>\nabla_{x_t}\ln q(x_t)</math> at that point, then we cannot impose the time-evolution equation on a particle:<math display="block">dx_{t}= \nabla_{x_t}\ln q(x_t) d t+d W_t</math>To deal with this problem, we perform [[Simulated annealing\|annealing]]. If <math>q</math> is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion. === Continuous diffusion processes === Line 107 ⟶ 124: Now, the equation is exactly a special case of the [[Brownian dynamics\|overdamped Langevin equation]]<math display="block">dx_t = -\frac{D}{k_BT} (\nabla_x U)dt + \sqrt{2D}dW_t</math>where <math>D</math> is diffusion tensor, <math>T</math> is temperature, and <math>U</math> is potential energy field. If we substitute in <math>D= \frac 12 \beta(t)I, k_BT = 1, U = \frac 12 \\|x\\|^2</math>, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models. Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to <math>q</math> at time <math>t=0</math>, then after a long time, the cloud of particles would settle into the stable distribution of <math>\mathcal{N}(0, I)</math>. Let <math>\rho_t</math> be the density of the cloud of particles at time <math>t</math>, then we have<math display="block">\rho_0 = q; \quad \rho_T \approx \mathcal{N}(0, I)</math>and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning. By [[Fokker–Planck equation\|Fokker-Planck equation]], the density of the cloud evolves according to<math display="block">\partial_t \ln \rho_t = \frac 12 \beta(t) \left( n + (x+ \nabla\ln\rho_t) \cdot \nabla \ln\rho_t + \Delta\ln\rho_t \right)</math>where <math>n</math> is the dimension of space, and <math>\Delta</math> is the [[Laplace operator]]. Equivalently,<math display="block">\partial_t \rho_t = \frac 12 \beta(t) ( \nabla\cdot(x\rho_t) + \Delta \rho_t)</math> ==== Backward diffusion process ==== If we have solved <math>\rho_t</math> for time <math>t\in [0, T]</math>, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density <math>\nu_0 = \rho_T</math>, and let the particles in the cloud evolve according to <math display="block">dy_t = \frac{1}{2} \beta(T-t) y_{t} d t + \beta(T-t) \underbrace{\nabla_{y_{t}} \ln \rho_{T-t}\left(y_{t}\right)}_{\text {score function }} d t+\sqrt{\beta(T-t)} d W_t</math> then by plugging into the Fokker-Planck equation, we find that <math>\partial_t \rho_{T-t} = \partial_t \nu_t</math>. Thus this cloud of points is the original cloud, evolving backwards.<ref>{{Cite journal \|last=Anderson \|first=Brian D.O. \|date=May 1982 \|title=Reverse-time diffusion equation models \|url=http://dx.doi.org/10.1016/0304-4149(82)90051-5 \|journal=Stochastic Processes and Their Applications \|volume=12 \|issue=3 \|pages=313–326 \|doi=10.1016/0304-4149(82)90051-5 \|issn=0304-4149\|url-access=subscription }}</ref> === Noise conditional score network (NCSN) === At the continuous limit, At the continuous limit, <math display="block">\bar \alpha_t = (1-\beta_1) \cdots (1-\beta_t) = e^{\sum_i \ln(1-\beta_i)} \to e^{-\int_0^t \beta(t)dt} </math>and so <math display="block">x_{t}\|x_0 \sim N\left(e^{-\frac 12\int_0^t \beta(t)dt} x_{0}, \left(1- e^{-\int_0^t \beta(t)dt}\right) I \right)</math>In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling <math>x_0 \sim q, z \sim N(0, I)</math>, then get <math>x_t = e^{-\frac 12\int_0^t \beta(t)dt} x_{0} + \left(1- e^{-\int_0^t \beta(t)dt}\right) z</math>. That is, we can quickly sample <math>x_t \sim \rho_t</math> for any <math>t \geq 0</math>. <math display="block">\bar \alpha_t = (1-\beta_1) \cdots (1-\beta_t) = e^{\sum_i \ln(1-\beta_i)} \to e^{-\int_0^t \beta(t)dt} </math> and so <math display="block">x_{t}\|x_0 \sim N\left(e^{-\frac 12\int_0^t \beta(t)dt} x_{0}, \left(1- e^{-\int_0^t \beta(t)dt}\right) I \right)</math> In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling <math>x_0 \sim q, z \sim \mathcal{N}(0, I)</math>, then get <math>x_t = e^{-\frac 12\int_0^t \beta(t)dt} x_{0} + \left(1- e^{-\int_0^t \beta(t)dt}\right) z</math>. That is, we can quickly sample <math>x_t \sim \rho_t</math> for any <math>t \geq 0</math>. Now, define a certain probability distribution <math>\gamma</math> over <math>[0, \infty)</math>, then the score-matching loss function is defined as the expected Fisher divergence: <math display="block">L(\theta) = E_{t\sim \gamma, x_t \sim \rho_t}[\\|f_\theta(x_t, t)\\|^2 + 2\nabla\cdot f_\theta(x_t, t)]</math> After training, <math>f_\theta(x_t, t) \approx \nabla \ln\rho_t</math>, so we can perform the backwards diffusion process by first sampling <math>x_T \sim \mathcal{N}(0, I)</math>, then integrating the SDE from <math>t=T</math> to <math>t=0</math>: <math display="block">x_{t-dt}=x_t + \frac{1}{2} \beta(t) x_{t} d t + \beta(t) f_\theta(x_t, t) d t+\sqrt{\beta(t)} d W_t</math> This may be done by any SDE integration method, such as [[Euler–Maruyama method]]. The name "noise conditional score network" is explained thus: Line 128 ⟶ 157: == Their equivalence == DDPM and score-based generative ~~model~~models are equivalent.<ref name=":9" /><ref name="song" /><ref>{{Cite ~~journal~~arXiv \|eprint=2208.11970v1 \|last=Luo \|first=Calvin \|date=2022 \|title=Understanding Diffusion Models: A Unified Perspective \|~~arxiv~~class=~~2208~~cs.~~11970~~LG }}</ref> This means that a network trained using DDPM can be used as a NCSN, and vice versa. We know that <math>x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, \sigma_{t}^2 I\right)</math>, so by [[Maurice Tweedie#Tweedie's formula\|Tweedie's formula]], we have <math display="block">\nabla_{x_t}\ln q(x_t) = \frac{1}{\sigma_{t}^2}(-x_t + \sqrt{\bar\alpha_t} E_q[x_0\|x_t])</math> As described previously, the DDPM loss function is <math>\sum_t L_{simple, t}</math> with <math display="block">L_{simple, t} = E_{x_0\sim q; z \sim \mathcal{N}(0, I)}\left[ \left\\| \epsilon_\theta(x_t, t) - z \right\\|^2\right]</math> where <math>x_t =\sqrt{\bar\alpha_t} x_{0} + \sigma_tz </math>. By a change of variables, <math display="block">L_{simple, t} = E_{x_0, x_t\sim q}\left[ \left\\| \epsilon_\theta(x_t, t) - \frac{x_t -\sqrt{\bar\alpha_t} x_{0}}{\sigma_t} \right\\|^2\right] = E_{x_t\sim q, x_0\sim q(\cdot \| x_t)}\left[ \left\\| \epsilon_\theta(x_t, t) - \frac{x_t -\sqrt{\bar\alpha_t} x_{0}}{\sigma_t} \right\\|^2\right]</math> and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have <math>\epsilon_\theta(x_t, t) = \frac{x_t -\sqrt{\bar\alpha_t} E_q[x_0\|x_t]}{\sigma_t} = -\sigma_t\nabla_{x_t}\ln q(x_t)</math> Thus, a score-based network predicts noise, and can be used for denoising. We know that <math>x_{t}\|x_0 \sim N\left(\sqrt{\bar\alpha_t} x_{0}, (1-\bar\alpha_t) I\right)</math>, so by [[Maurice Tweedie#Tweedie's formula\|Tweedie's formula]], we have<math display="block">\nabla_{x_t}\ln q(x_t) = \frac{1}{1-\bar\alpha_t}(-x_t + \sqrt{\bar\alpha_t} E_q[x_0\|x_t])</math>As described previously, the DDPM loss function is <math>\sum_t L_{simple, t}</math> with<math display="block">L_{simple, t} = E_{x_0\sim q; z \sim N(0, I)}\left[ \left\\| \epsilon_\theta(x_t, t) - z \right\\|^2\right]</math>where <math>x_t =\sqrt{\bar\alpha_t} x_{0} + \sqrt{1-\bar\alpha_t}z ~~</math>. By a change of variables,<math display="block">L_{simple, t} = E_{x_0, x_t\sim q}\left[ \left\\| \epsilon_\theta(x_t, t) -~~ ~~\frac{x_t -\sqrt{\bar\alpha_t} x_{0}}{\sqrt{1-\bar\alpha_t}} \right\\|^2\right] = E_{x_t\sim q, x_0\sim q(\cdot \| x_t)}\left[ \left\\| \epsilon_\theta(x_t, t) -~~ \frac{x_t -\sqrt{\bar\alpha_t} x_{0}}{\sqrt{1-\bar\alpha_t}} \right\\|^2\right]</math>and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have <math>\epsilon_\theta(x_t, t) = \frac{x_t -\sqrt{\bar\alpha_t} E_q[x_0\|x_t]}{\sqrt{1-\bar\alpha_t}} = -\sqrt{1-\bar\alpha_t}\nabla_{x_t}\ln q(x_t)</math>. ~~Now~~Conversely, the continuous limit <math>x_{t-1} = x_{t-dt}, \beta_t = \beta(t) dt, z_t\sqrt{dt} = dW_t</math> of the backward equation <math display="block">x_{t-1} = \frac{x_t}{\sqrt{\alpha_t}}- \frac{ \beta_t}{\sigma_{t}\sqrt{\alpha_t ~~(1-\bar\alpha_t)~~}} \epsilon_\theta(x_t, t) + \sqrt{\beta_t} z_t; \quad z_t \sim \mathcal{N}(0, I)</math> gives us precisely the same equation as score-based diffusion: <math display="block">x_{t-dt} = x_t(1+\beta(t)dt / 2) + \beta(t) \nabla_{x_t}\ln q(x_t) dt + \sqrt{\beta(t)}dW_t</math>Thus, at infinitesimal steps of DDPM, a denoising network performs score-based diffusion. == Main variants == === Noise schedule === [[File:Linear diffusion noise scheduler.svg\|thumb\|Illustration for a linear diffusion noise schedule. With settings <math>\beta_1 = 10^{-4}, \beta_{1000} = 0.02</math>.]] In DDPM, the sequence of numbers <math>0 = \sigma_0 < \sigma_1 < \cdots < \sigma_T < 1</math> is called a (discrete time) '''noise schedule'''. In general, consider a strictly increasing monotonic function <math>\sigma</math> of type <math>\R \to (0, 1)</math>, such as the [[sigmoid function]]. In that case, a noise schedule is a sequence of real numbers <math>\lambda_1 < \lambda_2 < \cdots < \lambda_T</math>. It then defines a sequence of noises <math>\sigma_t := \sigma(\lambda_t)</math>, which then derives the other quantities <math>\beta_t = 1 - \frac{1 - \sigma_t^2}{1 - \sigma_{t-1}^2}</math>. In order to use arbitrary noise schedules, instead of training a noise prediction model <math>\epsilon_\theta(x_t, t)</math>, one trains <math>\epsilon_\theta(x_t, \sigma_t)</math>. Similarly, for the noise conditional score network, instead of training <math>f_\theta(x_t, t)</math>, one trains <math>f_\theta(x_t, \sigma_t)</math>. === Denoising Diffusion Implicit Model (DDIM) === The original DDPM method for generating images is slow, since the forward diffusion process usually takes <math>T \sim 1000</math> to make the distribution of <math>x_T</math> to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as <math>x_t \| x_0</math> is gaussian for all <math>t \geq 1</math>, the backward diffusion process does not allow skipping steps. For example, to sample <math>x_{t-2}\|x_{t-1} \sim \mathcal{N}(\mu_\theta(x_{t-1}, t-1), \Sigma_\theta(x_{t-1}, t-1))</math> requires the model to first sample <math>x_{t-1}</math>. Attempting to directly sample <math>x_{t-2}\|x_t</math> would require us to marginalize out <math>x_{t-1}</math>, which is generally intractable. DDIM<ref>{{Cite ~~journal~~arXiv \|last1=Song \|first1=Jiaming \|last2=Meng \|first2=Chenlin \|last3=Ermon \|first3=Stefano \|date=~~2020~~3 Oct 2023 \|title=Denoising Diffusion Implicit Models \|~~arxiv~~class=cs.LG \|eprint=2010.02502}}</ref> is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using ~~less~~fewer sampling steps, DDIM outperforms DDPM. In detail, the DDIM sampling method is as follows. Start with the forward diffusion process <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sigma_t \epsilon</math>. Then, during the backward denoising process, given <math>x_t, \epsilon_\theta(x_t, t)</math>, the original data is estimated as <math display="block">x_0' = \frac{x_t - \sigma_t \epsilon_\theta(x_t, t)}{ \sqrt{\bar\alpha_t}}</math>then the backward diffusion process can jump to any step <math>0 \leq s < t</math>, and the next denoised sample is <math display="block">x_{s} = \sqrt{\bar\alpha_{s}} x_0' + \sqrt{\sigma_{s}^2 - (\sigma'_s)^2} \epsilon_\theta(x_t, t) + \sigma_s' \epsilon</math>where <math>\sigma_s'</math> is an arbitrary real number within the range <math>[0, \sigma_s]</math>, and <math>\epsilon \sim \mathcal{N}(0, I)</math> is a newly sampled gaussian noise.<ref name=":7" /> If all <math>\sigma_s' = 0</math>, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level. The original paper recommended defining a single "eta value" <math>\eta \in [0, 1]</math>, such that <math>\sigma_s' = \eta \tilde\sigma_s</math>. When <math>\eta = 1</math>, this is the original DDPM. When <math>\eta = 0</math>, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them. By the equivalence, the DDIM algorithm also applies for score-based diffusion models. === Latent diffusion model (LDM) === {{Main\|Latent diffusion model}} Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.<ref name=":2">{{Cite journal \|last1=Rombach \|first1=Robin \|last2=Blattmann \|first2=Andreas \|last3=Lorenz \|first3=Dominik \|last4=Esser \|first4=Patrick \|last5=Ommer \|first5=Björn \|date=2022 \|title=High-Resolution Image Synthesis With Latent Diffusion Models \|url=https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html \|language=en \|pages=10684–10695\|arxiv=2112.10752 }}</ref> Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.<ref name=":2">{{Cite arXiv\|last1=Rombach \|first1=Robin \|last2=Blattmann \|first2=Andreas \|last3=Lorenz \|first3=Dominik \|last4=Esser \|first4=Patrick \|last5=Ommer \|first5=Björn \|date=13 April 2022 \|title=High-Resolution Image Synthesis With Latent Diffusion Models \|class=cs.CV \|eprint=2112.10752 }}</ref> The encoder-decoder pair is most often a [[variational autoencoder]] (VAE). === Architectural improvements === <ref>{{Cite journal \|last1=Nichol \|first1=Alexander Quinn \|last2=Dhariwal \|first2=Prafulla \|date=2021-07-01 \|title=Improved Denoising Diffusion Probabilistic Models \|url=https://proceedings.mlr.press/v139/nichol21a.html \|journal=Proceedings of the 38th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=8162–8171}}</ref> proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from <math>x_{t-1} \sim \mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)</math>, they recommended sampling from <math>\mathcal{N}(\tilde\mu_t(x_t, \tilde x_0), (\sigma_t^v \tilde\sigma_t^{1-v})^2 I)</math> for a learned parameter <math>v</math>. In the ''v-prediction'' formalism, the noising formula <math>x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon_t</math> is reparameterised by an angle <math>\phi_t</math> such that <math>\cos \phi_t = \sqrt{\bar\alpha_t}</math> and a "velocity" defined by <math>\cos\phi_t \epsilon_t - \sin\phi_t x_0</math>. The network is trained to predict the velocity <math>\hat v_\theta</math>, and denoising is by <math>x_{\phi_t - \delta} = \cos(\delta)\; x_{\phi_t} - \sin(\delta) \hat{v}_{\theta}\; (x_{\phi_t}) </math>.<ref>{{Cite conference\|conference=The Tenth International Conference on Learning Representations (ICLR 2022)\|last1=Salimans\|first1=Tim\|last2=Ho\|first2=Jonathan\|date=2021-10-06\|title=Progressive Distillation for Fast Sampling of Diffusion Models\|url=https://openreview.net/forum?id=TIdIXIpzhoI\|language=en}}</ref> This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e. <math>\phi_t = 90^\circ</math>) and then reverse it, whereas the standard parameterization never reaches total noise since <math>\sqrt{\bar\alpha_t} > 0</math> is always true.<ref>{{Cite conference\|conference=IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\|last1=Lin \|first1=Shanchuan \|last2=Liu \|first2=Bingchen \|last3=Li \|first3=Jiashi \|last4=Yang \|first4=Xiao \|date=2024 \|title=Common Diffusion Noise Schedules and Sample Steps Are Flawed \|url=https://openaccess.thecvf.com/content/WACV2024/html/Lin_Common_Diffusion_Noise_Schedules_and_Sample_Steps_Are_Flawed_WACV_2024_paper.html \|language=en \|pages=5404–5411}}</ref> === Classifier guidance === Classifier guidance was proposed in 2021 to improve class-conditional generation by using a classifier. The original publication used [[Contrastive Language-Image Pre-training\|CLIP text encoders]] to improve text-conditional image generation.<ref name=":8" /> Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution <math>p(x\|y)</math>, where <math>x</math> ranges over images, and <math>y</math> ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description). Taking the perspective of the [[noisy channel model]], we can understand the process as follows: To generate an image <math>x</math> conditional on description <math>y</math>, we imagine that the requester really had in mind an image <math>x</math>, but the image is passed through a noisy channel and came out garbled, as <math>y</math>. Image generation is then nothing but inferring which <math>x</math> the requester had in mind. In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get <math display="block">p(x\|y) \propto p(y\|x)p(x) </math> in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the score <math>\nabla \ln p(x) </math> can be replaced by <math display="block">\nabla_x \ln p(x\|y) = \underbrace{\nabla_x \ln p(y\|x)}_{\text{score}} + \underbrace{\nabla_x \ln p(y\|x)}_{\text{classifier guidance}}</math> where <math>\nabla_x \ln p(x)</math> is the score function, trained as previously described, and <math>\nabla_x \ln p(y\|x)</math> is found by using a differentiable image classifier. During the diffusion process, we need to condition on the time, giving<math display="block">\nabla_{x_t} \ln p(x_t\|y, t) = \nabla_{x_t} \ln p(y\|x_t, t) + \nabla_{x_t} \ln p(x_t\|t) </math>Although, usually the classifier model does not depend on time, in which case <math>p(y\|x_t, t) = p(y\|x_t) </math>. ~~=== With temperature ===~~ The classifier-guided diffusion model samples from <math>p(x\|y)</math>, which is concentrated around the [[Maximum a posteriori estimation\|maximum a posteriori estimate]] <math>\arg\max_x p(x\|y)</math>. If we want to force the model to move towards the [[Maximum likelihood estimation\|maximum likelihood estimate]] <math>\arg\max_x p(y\|x)</math>, we can use <math display="block">p_\beta(x\|y) \propto p(y\|x)^\beta p(x) </math>where <math>\beta > 0 </math> is interpretable as ''[[Thermodynamic beta\|inverse temperature]]''. In the context of diffusion models, it is usually called the '''guidance scale'''. A high <math>\beta </math> would force the model to sample from a distribution concentrated around <math>\arg\max_x p(y\|x)</math>. This often improves quality of generated images.<ref>{{Cite arXiv \|last1=Dhariwal \|first1=Prafulla \|last2=Nichol \|first2=Alex \|date=2021-06-01 \|title=Diffusion Models Beat GANs on Image Synthesis \|class=cs.LG \|eprint=2105.05233 }}</ref> Classifier guidance is defined for the gradient of score function, thus for score-based diffusion network, but as previously noted, score-based diffusion models are equivalent to denoising models by <math>\epsilon_\theta(x_t, t) = ~~This can be done simply by SGLD with<math display="block">\nabla_x \ln p_\beta(x\|y) = \beta\nabla_x \ln p(y\|x) + \nabla_x \ln p(x) </math>~~ -\sigma_t\nabla_{x_t}\ln p(x_t\|t)</math>, and similarly, <math>\epsilon_\theta(x_t, y, t) = -\sigma_t\nabla_{x_t}\ln p(x_t\|y, t)</math>. Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:<ref name=":8" /><math display="block">\epsilon_\theta(x_t, y, t) = \epsilon_\theta(x_t, t) - \underbrace{\sigma_t \nabla_{x_t} \ln p(y\|x_t, t)}_{\text{classifier guidance}} </math> ==== With temperature ==== The classifier-guided diffusion model samples from <math>p(x\|y)</math>, which is concentrated around the [[Maximum a posteriori estimation\|maximum a posteriori estimate]] <math>\arg\max_x p(x\|y)</math>. If we want to force the model to move towards the [[Maximum likelihood estimation\|maximum likelihood estimate]] <math>\arg\max_x p(y\|x)</math>, we can use <math display="block">p_\gamma(x\|y) \propto p(y\|x)^\gamma p(x)</math> where <math>\gamma > 0 </math> is interpretable as ''[[Thermodynamic beta\|inverse temperature]]''. In the context of diffusion models, it is usually called the '''guidance scale'''. A high <math>\gamma </math> would force the model to sample from a distribution concentrated around <math>\arg\max_x p(y\|x)</math>. This sometimes improves quality of generated images.<ref name=":8">{{Cite arXiv \|last1=Dhariwal \|first1=Prafulla \|last2=Nichol \|first2=Alex \|date=2021-06-01 \|title=Diffusion Models Beat GANs on Image Synthesis \|class=cs.LG \|eprint=2105.05233 }}</ref> This gives a modification to the previous equation:<math display="block">\nabla_x \ln p_\beta(x\|y) = \nabla_x \ln p(x) + \gamma \nabla_x \ln p(y\|x) </math>For denoising models, it corresponds to<ref name=":5" /><math display="block">\epsilon_\theta(x_t, y, t) = \epsilon_\theta(x_t, t) - \gamma \sigma_t \nabla_{x_t} \ln p(y\|x_t, t) </math> === Classifier-free guidance (CFG) === If we do not have a classifier <math>p(y\|x)</math>, we could still extract one out of the image model itself:<ref name=":5">{{Cite arXiv \|last1=Ho \|first1=Jonathan \|last2=Salimans \|first2=Tim \|date=2022-07-25 \|title=Classifier-Free Diffusion Guidance \|class=cs.LG \|eprint=2207.12598 }}</ref> <math display="block">\nabla_x \ln p_\~~beta~~gamma(x\|y) = (1-\~~beta~~gamma) \nabla_x \ln p(x) + \~~beta~~gamma \nabla_x \ln p(x\|y) </math> Such a model is usually trained by presenting it with both <math>(x, y) </math> and <math>(x, {\rm None}) </math>, allowing it to model both <math>\nabla_x\ln p(x\|y) </math> and <math>\nabla_x\ln p(x) </math>. Note that for CFG, the diffusion model cannot be merely a generative model of the entire data distribution <math>\nabla_x \ln p(x) </math>. It must be a conditional generative model <math>\nabla_x \ln p(x \| y) </math>. For example, in stable diffusion, the diffusion backbone takes as input both a noisy model <math>x_t </math>, a time <math>t </math>, and a conditioning vector <math>y </math> (such as a vector encoding a text prompt), and produces a noise prediction <math>\epsilon_\theta(x_t, y, t) </math>. For denoising models, it corresponds to<math display="block">\epsilon_\theta(x_t, y, t, \gamma) = \epsilon_\theta(x_t, t) + \gamma (\epsilon_\theta(x_t, y, t) - \epsilon_\theta(x_t, t))</math>As sampled by DDIM, the algorithm can be written as<ref>{{cite arXiv \|last1=Chung \|first1=Hyungjin \|title=CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models \|date=2024-06-12 \|eprint=2406.08070 \|last2=Kim \|first2=Jeongsol \|last3=Park \|first3=Geon Yeong \|last4=Nam \|first4=Hyelin \|last5=Ye \|first5=Jong Chul\|class=cs.CV }}</ref><math display="block">\begin{aligned} \epsilon_{\text{uncond}} &\leftarrow \epsilon_\theta(x_t, t) \\ \epsilon_{\text{cond}} &\leftarrow \epsilon_\theta(x_t, t, c) \\ \epsilon_{\text{CFG}} &\leftarrow \epsilon_{\text{uncond}} + \gamma(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})\\ x_0 &\leftarrow (x_t - \sigma_t \epsilon_{\text{CFG}}) / \sqrt{1 - \sigma_t^2}\\ x_s &\leftarrow \sqrt{1 - \sigma_s^2} x_0 + \sqrt{\sigma_s^2 - (\sigma_s')^2} \epsilon_{\text{uncond}} + \sigma_s' \epsilon\\ \end{aligned}</math>A similar technique applies to language model sampling. Also, if the unconditional generation <math>\epsilon_{\text{uncond}} \leftarrow \epsilon_\theta(x_t, t) </math> is replaced by <math>\epsilon_{\text{neg cond}} \leftarrow \epsilon_\theta(x_t, t, c') </math>, then it results in negative prompting, which pushes the generation away from <math>c' </math> condition.<ref>{{cite arXiv \|last1=Sanchez \|first1=Guillaume \|title=Stay on topic with Classifier-Free Guidance \|date=2023-06-30 \|eprint=2306.17806 \|last2=Fan \|first2=Honglu \|last3=Spangher \|first3=Alexander \|last4=Levi \|first4=Elad \|last5=Ammanamanchi \|first5=Pawan Sasanka \|last6=Biderman \|first6=Stella\|class=cs.CL }}</ref><ref>{{cite arXiv \|last1=Armandpour \|first1=Mohammadreza \|title=Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond \|date=2023-04-26 \|eprint=2304.04968 \|last2=Sadeghian \|first2=Ali \|last3=Zheng \|first3=Huangjie \|last4=Sadeghian \|first4=Amir \|last5=Zhou \|first5=Mingyuan\|class=cs.CV }}</ref> === Samplers === Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "'''noise schedule'''" <math>\beta_t</math> can also affect the quality of samples. ~~In the DDPM perspective, one can use the DDPM itself (with~~A noise), orschedule ~~DDIM~~is ~~(with~~a ~~adjustable~~function ~~amount~~that ofsends ~~noise).~~a ~~The~~natural ~~case~~number ~~where~~to ~~one adds~~a noise islevel: ~~sometimes called ancestral sampling.~~<~~ref>{{Cite journal~~math ~~\|last1~~display=~~Yang~~"block">t ~~\|first1=Ling~~\mapsto ~~\|last2=Zhang~~\beta_t, ~~\|first2=Zhilong~~\quad ~~\|last3=Song~~t ~~\|first3=Yang~~\in ~~\|last4=Hong~~\{1, ~~\|first4=Shenda~~2, ~~\|last5=Xu~~\dots\}, ~~\|first5=Runsheng~~\beta ~~\|last6=Zhao~~\in ~~\|first6=Yue~~(0, ~~\|last7=Zhang \|first7=Wentao \|last8=Cui \|first8=Bin \|last9=Yang \|first9=Ming-Hsuan \|date=2022 \|title=Diffusion Models: A Comprehensive Survey of Methods and Applications \|arxiv=2209.00796}}~~1)</~~ref~~math> ~~One can interpolate between~~A noise ~~and~~schedule nois ~~noise.~~more ~~The~~often ~~amount~~specified ofby ~~noise~~a ~~is denoted~~map <math>t \~~eta~~mapsto \sigma_t</math>. ~~("eta~~The ~~value")~~two ~~in the~~definitions ~~DDIM~~are ~~paper~~equivalent, ~~with~~since <math>\~~eta~~beta_t = ~~0</math>~~1 ~~denoting~~- no\frac{1 ~~noise~~- ~~(as~~\sigma_t^2}{1 in- ~~''deterministic'' DDIM), and <math>~~\~~eta =~~ sigma_{t-1}^2}</math> ~~denoting full noise (as in DDPM)~~. In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.<ref>{{cite arXiv \|eprint=2206.00364 \|last1=Yang \|first1=Ling \|last2=Zhang \|first2=Zhilong \|last3=Song \|first3=Yang \|last4=Hong \|first4=Shenda \|last5=Xu \|first5=Runsheng \|last6=Zhao \|first6=Yue \|last7=Zhang \|first7=Wentao \|last8=Cui \|first8=Bin \|last9=Yang \|first9=Ming-Hsuan \|date=2022 \|title=Diffusion Models: A Comprehensive Survey of Methods and Applications\|class=cs.CV }}</ref> One can interpolate between noise and no noise. The amount of noise is denoted <math>\eta</math> ("eta value") in the DDIM paper, with <math>\eta = 0</math> denoting no noise (as in ''deterministic'' DDIM), and <math>\eta = 1</math> denoting full noise (as in DDPM). In the perspective of SDE, one can use any of the [[Numerical methods for ordinary differential equations\|numerical integration methods]], such as [[Euler–Maruyama method]], [[Heun's method]], [[linear multistep method]]s, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration. AIn ~~survey~~the ~~and~~perspective ~~comparison~~of SDE, one can use any of ~~samplers~~the [[Numerical methods for ordinary differential equations\|numerical integration methods]], such as [[Euler–Maruyama method]], [[Heun's method]], [[linear multistep method]]s, etc. Just as in the ~~context~~discrete case, one can add an adjustable amount of ~~image~~noise ~~generation~~during isthe inintegration.<ref>{{ Cite ~~journal~~arXiv \| eprint=2406.04329 \| last1=~~Karras~~Shi \| first1=~~Tero~~Jiaxin \| last2=~~Aittala~~Han \| first2=~~Miika~~Kehang \| last3=~~Aila~~Wang \| first3=~~Timo~~Zhe \| last4=~~Laine~~Doucet \| first4=~~Samuli~~Arnaud \|~~date~~ last5=~~2022~~Titsias \|~~title~~ first5=~~Elucidating~~Michalis ~~the~~K. ~~Design~~\| ~~Space~~title=Simplified ofand Generalized Masked Diffusion~~-Based~~ ~~Generative~~for ~~Models~~Discrete Data \|~~arxiv~~ date=~~2206~~2024 \| class=cs.~~00364~~LG }}</ref> A survey and comparison of samplers in the context of image generation is in.<ref>{{ Cite arXiv \|eprint=2206.00364v2 \|last1=Karras \|first1=Tero \|last2=Aittala \|first2=Miika \|last3=Aila \|first3=Timo \|last4=Laine \|first4=Samuli \|date=2022 \|title=Elucidating the Design Space of Diffusion-Based Generative Models\|class=cs.CV }}</ref> === Other examples === Notable variants include<ref>{{Cite journal \|last1=Cao \|first1=Hanqun \|last2=Tan \|first2=Cheng \|last3=Gao \|first3=Zhangyang \|last4=Xu \|first4=Yilun \|last5=Chen \|first5=Guangyong \|last6=Heng \|first6=Pheng-Ann \|last7=Li \|first7=Stan Z. \|date=July 2024 \|title=A Survey on Generative Diffusion Models \|journal=IEEE Transactions on Knowledge and Data Engineering \|volume=36 \|issue=7 \|pages=2814–2830 \|doi=10.1109/TKDE.2024.3361474 \|bibcode=2024ITKDE..36.2814C \|issn=1041-4347}}</ref> Poisson flow generative model,<ref>{{Cite journal \|last1=Xu \|first1=Yilun \|last2=Liu \|first2=Ziming \|last3=Tian \|first3=Yonglong \|last4=Tong \|first4=Shangyuan \|last5=Tegmark \|first5=Max \|last6=Jaakkola \|first6=Tommi \|date=2023-07-03 \|title=PFGM++: Unlocking the Potential of Physics-Inspired Generative Models \|url=https://proceedings.mlr.press/v202/xu23m.html \|journal=Proceedings of the 40th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=38566–38591\|arxiv=2302.04265 }}</ref> consistency model,<ref>{{Cite journal \|last1=Song \|first1=Yang \|last2=Dhariwal \|first2=Prafulla \|last3=Chen \|first3=Mark \|last4=Sutskever \|first4=Ilya \|date=2023-07-03 \|title=Consistency Models \|url=https://proceedings.mlr.press/v202/song23a \|journal=Proceedings of the 40th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=32211–32252}}</ref> critically damped Langevin diffusion,<ref>{{Cite arXiv \|last1=Dockhorn \|first1=Tim \|last2=Vahdat \|first2=Arash \|last3=Kreis \|first3=Karsten \|date=2021-10-06 \|title=Score-Based Generative Modeling with Critically-Damped Langevin Diffusion \|class=stat.ML \|eprint=2112.07068 }}</ref> GenPhys,<ref>{{cite arXiv \|last1=Liu \|first1=Ziming \|title=GenPhys: From Physical Processes to Generative Models \|date=2023-04-05 \|eprint=2304.02637 \|last2=Luo \|first2=Di \|last3=Xu \|first3=Yilun \|last4=Jaakkola \|first4=Tommi \|last5=Tegmark \|first5=Max\|class=cs.LG }}</ref> cold diffusion,<ref>{{Cite journal \|last1=Bansal \|first1=Arpit \|last2=Borgnia \|first2=Eitan \|last3=Chu \|first3=Hong-Min \|last4=Li \|first4=Jie \|last5=Kazemi \|first5=Hamid \|last6=Huang \|first6=Furong \|last7=Goldblum \|first7=Micah \|last8=Geiping \|first8=Jonas \|last9=Goldstein \|first9=Tom \|date=2023-12-15 \|title=Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise \|url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/80fe51a7d8d0c73ff7439c2a2554ed53-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=36 \|pages=41259–41282\|arxiv=2208.09392 }}</ref> discrete diffusion,<ref>{{Cite journal \|last1=Gulrajani \|first1=Ishaan \|last2=Hashimoto \|first2=Tatsunori B. \|date=2023-12-15 \|title=Likelihood-Based Diffusion Language Models \|url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/35b5c175e139bff5f22a5361270fce87-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=36 \|pages=16693–16715\|arxiv=2305.18619 }}</ref><ref>{{cite arXiv \|last1=Lou \|first1=Aaron \|title=Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution \|date=2024-06-06 \|eprint=2310.16834 \|last2=Meng \|first2=Chenlin \|last3=Ermon \|first3=Stefano\|class=stat.ML }}</ref> etc. == Flow-based diffusion model == Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard gaussian distribution), by building an absolutely continuous probability path connecting them. The probability path is in fact defined implicitly by the score function <math>\nabla \ln p_t </math>. In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes are [[Stochastic differential equation\|SDEs]], though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive. The probability path in diffusions model is defined through an [[Itô process]] and one can retrieve the deterministic process by using the Probability ODE flow formulation.<ref name="song" /> In flow-based diffusion models, the forward process is a deterministic flow along a time-dependent vector field, and the backward process is also a deterministic flow along the same vector field, but going backwards. Both processes are solutions to [[Ordinary differential equation\|ODEs]]. If the vector field is well-behaved, the ODE will also be well-behaved. Given two distributions <math>\pi_0</math> and <math>\pi_1</math>, a flow-based model is a time-dependent velocity field <math>v_t(x)</math> in <math>[0,1] \times \mathbb R^d </math>, such that if we start by sampling a point <math>x \sim \pi_0</math>, and let it move according to the velocity field: <math display="block">\frac{d}{dt} \phi_t(x) = v_t(\phi_t(x)) \quad t \in [0,1], \quad \text{starting from }\phi_0(x) = x</math> we end up with a point <math>x_1 \sim \pi_1</math>. The solution <math>\phi_t</math> of the above ODE define a probability path <math>p_t = [\phi_t]_{\#} \pi_0 </math> by the [[pushforward measure]] operator. In particular, <math>[\phi_1]_{\#} \pi_0 = \pi_1</math>. The probability path and the velocity field also satisfy the [[continuity equation]], in the sense of probability distribution: <math display="block">\partial_t p_t + \nabla \cdot (v_t p_t) = 0</math> To construct a probability path, we start by construct a conditional probability path <math>p_t(x \vert z)</math> and the corresponding conditional velocity field <math>v_t(x \vert z)</math> on some conditional distribution <math>q(z)</math>. A natural choice is the Gaussian conditional probability path: <math display="block">p_t(x \vert z) = \mathcal{N} \left( m_t(z), \zeta_t^2 I \right) </math> The conditional velocity field which corresponds to the geodesic path between conditional Gaussian path is <math display="block">v_t(x \vert z) = \frac{\zeta_t'}{\zeta_t} (x - m_t(z)) + m_t'(z)</math> The probability path and velocity field are then computed by marginalizing <math>p_t(x) = \int p_t(x \vert z) q(z) dz \qquad \text{ and } \qquad v_t(x) = \mathbb{E}_{q(z)} \left[\frac{v_t(x \vert z) p_t(x \vert z)}{p_t(x)} \right]</math> === Optimal transport flow === The idea of '''optimal transport flow''' <ref>{{Cite journal \|last1=Tong \|first1=Alexander \|last2=Fatras \|first2=Kilian \|last3=Malkin \|first3=Nikolay \|last4=Huguet \|first4=Guillaume \|last5=Zhang \|first5=Yanlei \|last6=Rector-Brooks \|first6=Jarrid \|last7=Wolf \|first7=Guy \|last8=Bengio \|first8=Yoshua \|date=2023-11-08 \|title=Improving and generalizing flow-based generative models with minibatch optimal transport \|url=https://openreview.net/forum?id=CD9Snc73AW \|journal=Transactions on Machine Learning Research \|arxiv=2302.00482 \|language=en \|issn=2835-8856}}</ref> is to construct a probability path minimizing the [[Wasserstein metric]]. The distribution on which we condition is an approximation of the optimal transport plan between <math>\pi_0 </math> and <math>\pi_1 </math>: <math>z = (x_0, x_1) </math> and <math>q(z) = \Gamma(\pi_0, \pi_1) </math>, where <math>\Gamma</math> is the optimal transport plan, which can be approximated by '''mini-batch optimal transport.''' If the batch size is not large, then the transport it computes can be very far from the true optimal transport. === Rectified flow === The idea of '''rectified flow<ref name=":0">{{cite arXiv\|last1=Liu \|first1=Xingchao \|title=Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow \|date=2022-09-07 \|eprint=2209.03003 \|last2=Gong \|first2=Chengyue \|last3=Liu \|first3=Qiang\|class=cs.LG }}</ref>'''<ref>{{cite arXiv \|last=Liu \|first=Qiang \|title=Rectified Flow: A Marginal Preserving Approach to Optimal Transport \|date=2022-09-29 \|class=stat.ML \|eprint=2209.14577}}</ref> is to learn a flow model such that the velocity is nearly constant along each flow path. This is beneficial, because we can integrate along such a vector field with very few steps. For example, if an ODE <math>\dot{\phi_t}(x) = v_t(\phi_t(x))</math> follows perfectly straight paths, it simplifies to <math>\phi_t(x) = x_0 + t \cdot v_0(x_0)</math>, allowing for exact solutions in one step. In practice, we cannot reach such perfection, but when the flow field is nearly so, we can take a few large steps instead of many little steps. In standard diffusion modeling, the forward process turns the dataset distribution into white noise by adding a little bit of white noise at a time, and the backward process turns white noise back to the dataset distribution by removing a little bit of white noise at a time. {\| class="wikitable" style="border: 1px solid #ccc; width: 95%;" \|- \| style="text-align: center;" \| [[File:Flow1.gif\|160px]] \|\| style="text-align: center;" \| [[File:Flow0.gif\|160px]] \|\| style="text-align: center;" \| [[File:Flow2.gif\|160px]] \|- \| style="text-align: center;" \| <small>Linear interpolation </small> \|\| style="text-align: center;" \| <small>Rectified Flow </small> \|\| style="text-align: center;" \| <small>Straightened Rectified Flow [https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html]</small> \|} The general idea is to start with two distributions <math>\pi_0</math> and <math>\pi_1</math>, then construct a flow field <math>\phi^0 = \{\phi_t: t\in[0,1]\}</math> from it, then repeatedly apply a "reflow" operation to obtain successive flow fields <math>\phi^1, \phi^2, \dots</math>, each straighter than the previous one. When the flow field is straight enough for the application, we stop. Generally, for any time-differentiable process <math>\phi_t</math>, <math>v_t</math> can be estimated by solving: <math display="block">\min_{\theta} \int_0^1 \mathbb{E}_{x \sim p_t}\left [\lVert{v_t(x, \theta) - v_t(x)}\rVert^2\right] \,\mathrm{d}t.</math> In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization. ~~If the forward process is "nice", then the backward process might also be nice. The rectified flow is a certain nice forward process, and it is used in Stable Diffusion 3.0.~~ [[File:Rectified Flow.png\|thumb\|240px\|Transport by rectified flow<ref name=":0"/>]] Specifically, rectified flow seeks to match an ODE with the marginal distributions of the '''linear interpolation''' between points from distributions <math>\pi_0</math> and <math>\pi_1</math>. Given observations <math>x_0 \sim \pi_0</math> and <math>x_1 \sim \pi_1</math>, the canonical linear interpolation <math>x_t= t x_1 + (1-t)x_0, t\in [0,1]</math> yields a trivial case <math>\dot{x}_t = x_1 - x_0</math>, which cannot be causally simulated without <math>x_1</math>. To address this, <math>x_t</math> is "projected" into a space of causally simulatable ODEs, by minimizing the least squares loss with respect to the direction <math>x_1 - x_0</math>: <math display="block">\min_{\theta} \int_0^1 \mathbb{E}_{\pi_0, \pi_1, p_t}\left [\lVert{(x_1-x_0) - v_t(x_t)}\rVert^2\right] \,\mathrm{d}t.</math> The data pair <math>(x_0, x_1)</math> can be any coupling of <math>\pi_0</math> and <math>\pi_1</math>, typically independent (i.e., <math>(x_0,x_1) \sim \pi_0 \times \pi_1</math>) obtained by randomly combining observations from <math>\pi_0</math> and <math>\pi_1</math>. This process ensures that the trajectories closely mirror the density map of <math>x_t</math> trajectories but ''reroute'' at intersections to ensure causality. Rescale the time-interval to <math>[0, 1]</math>, then let the starting point be <math>x_0</math>, an image sampled from the natural image distribution. A forward process neural network is a function <math>v(x_t, t)</math>, such that integrating<math display="block">dx_t = v(x_t, t) dt</math>would give us <math>x_1</math>, a white-noise image. [[File:Reflow Illustration.png\|thumb\|390px\|The reflow process<ref name=":0"/>]] A distinctive aspect of rectified flow is its capability for "'''reflow'''", which straightens the trajectory of ODE paths. Denote the rectified flow <math>\phi^0 = \{\phi_t: t\in[0,1]\}</math> induced from <math>(x_0,x_1)</math> as <math>\phi^0 = \mathsf{Rectflow}((x_0,x_1))</math>. Recursively applying this <math>\mathsf{Rectflow}(\cdot)</math> operator generates a series of rectified flows <math>\phi^{k+1} = \mathsf{Rectflow}((\phi_0^k(x_0), \phi_1^k(x_1)))</math>. This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making <math>\phi^k</math> paths straighter with increasing <math>k</math>. Given a probabilistic coupling over <math>(x_0, x_1) </math>, we can train a new velocity field <math>v_\theta</math> on the space of all images by minimizing the expectation of <math>\int_0^1 \\|(x_1 - x_0 ) - v_\theta(x_t, t) \\|^2 dt</math>. Intuitively, this means that the velocity field attempts to guide each noisy image to a natural-looking image on a path as straight as possible, except those points where it is very ambiguous, at which point the image is guided to the average of several natural-looking images that it might end up with. Rectified flow includes a nonlinear extension where linear interpolation <math>x_t</math> is replaced with any time-differentiable curve that connects <math>x_0</math> and <math>x_1</math>, given by <math>x_t = \alpha_t x_1 + \beta_t x_0</math>. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of <math>\alpha_t</math> and <math>\beta_t</math>. However, in the case where the path of <math>x_t</math> is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of <math>\phi_t</math>.<ref name=":0" /> Now, since a velocity field defines another probabilistic coupling, by integrating the velocity field into a path, we can iterate the rectified flow operation. Eventually, we would obtain a probabilistic coupling between noise and natural-looking images that are connected by mostly-straight paths in a vector field. This then gives us a particularly nice forward process. We can then train a backward process neural network by the same denoising loss function. Since the forward process consists of rather straight paths, instead of jagged Brownian motion paths, the backward process can be nicer and can handle large steps in Euler sampling.<ref name=":7">{{Citation \|last=Liu \|first=Xingchao \|title=Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow \|date=2022-09-07 \|url=http://arxiv.org/abs/2209.03003 \|access-date=2024-03-06 \|doi=10.48550/arXiv.2209.03003 \|last2=Gong \|first2=Chengyue \|last3=Liu \|first3=Qiang}}</ref><ref name=":8">{{Cite web \|title=Rectified Flow — Rectified Flow \|url=https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html \|access-date=2024-03-06 \|website=www.cs.utexas.edu}}</ref> == Choice of architecture == [[File:~~Stable~~ Diffusion ~~architecture~~Architecture.png\|thumb\|287x287px\|Architecture of Stable Diffusion]] [[File:X-Y plot of algorithmically-generated AI art of European-style castle in Japan demonstrating DDIM diffusion steps.png\|thumb\|304x304px\|The denoising process used by Stable Diffusion]] Line 191 ⟶ 325: For generating images by DDPM, we need a neural network that takes a time <math>t</math> and a noisy image <math>x_t</math>, and predicts a noise <math>\epsilon_\theta(x_t, t)</math> from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it from <math>x_t</math>, denoising architectures tend to work well. For example, the [[U-Net]], which was found to be good for denoising images, is often used for denoising diffusion models that generate images.<ref name=":3">{{Cite journal \|last1=Ho \|first1=Jonathan \|last2=Saharia \|first2=Chitwan \|last3=Chan \|first3=William \|last4=Fleet \|first4=David J. \|last5=Norouzi \|first5=Mohammad \|last6=Salimans \|first6=Tim \|date=2022-01-01 \|title=Cascaded diffusion models for high fidelity image generation \|url=https://dl.acm.org/doi/abs/10.5555/3586589.3586636 \|journal=The Journal of Machine Learning Research \|volume=23 \|issue=1 \|pages=47:2249–47:2281 \|arxiv=2106.15282 \|issn=1532-4435}}</ref> {{Anchor\|Diffusion Transformer\|DiT}}For DDPM, the underlying architecture ("backbone") does not have to be a U-Net. It just ~~have~~has to predict the noise somehow. ~~The~~For example, the diffusion transformer (DiT) uses a [[Transformer (deep learning architecture)\|Transformer]] to predict the mean and diagonal covariance of the noise, given the textual conditioning and the partially denoised image. It is the same as standard U-Net-based denoising diffusion model, with a Transformer replacing the U-Net.<ref>{{Cite ~~journal~~arXiv \|eprint=2212.09748v2 \|~~last~~last1=Peebles \|~~first~~first1=William \|last2=Xie \|first2=Saining \|date=March 2023 \|title=Scalable Diffusion Models with Transformers \|~~url~~class=~~https://openaccess~~cs.~~thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html~~CV \|language=en}}</ref> [[Mixture of experts]]-Transformer can also be applied.<ref>{{cite arXiv \|~~pages~~last1=~~4195–4205~~Fei \|first1=Zhengcong \|title=Scaling Diffusion Transformers to 16 Billion Parameters \|date=2024-07-16 \|eprint=2407.11633 \|last2=Fan \|first2=Mingyuan \|last3=Yu \|first3=Changqian \|last4=Li \|first4=Debang \|last5=Huang \|first5=Junshi\|class=cs.CV }}</ref> DDPM can be used to model general data distributions, not just natural-looking images. For example, ''Human Motion Diffusion ~~Model''~~<ref name=":4">{{Cite ~~journal~~arXiv \|eprint=2209.14916 \|last1=Tevet \|first1=Guy \|last2=Raab \|first2=Sigal \|last3=Gordon \|first3=Brian \|last4=Shafir \|first4=Yonatan \|last5=Cohen-Or \|first5=Daniel \|last6=Bermano \|first6=Amit H. \|date=2022 \|title=Human Motion Diffusion Model \|~~arxiv~~class=~~2209~~cs.~~14916~~CV }}</ref> models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses a [[Transformer (deep learning architecture)\|Transformer]] network to generate a less noisy trajectory out of a noisy one. === Conditioning === The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned on [[ImageNet]] would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition, and then sample from the conditional distribution. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector. Stable Diffusion, for example, imposes conditioning in the form of [[Attention (machine learning)\|cross-attention mechanism]], where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors. The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.<ref ~~name="ReferenceA"~~>{{Cite ~~journal~~arXiv \|last1=Zhang \|first1=Lvmin \|last2=Rao \|first2=Anyi \|last3=Agrawala \|first3=Maneesh \|date=2023 \|title=Adding Conditional Control to Text-to-Image Diffusion Models \|~~arxiv~~class=cs.CV \|eprint=2302.05543}}</ref~~> The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.<ref name="ReferenceA"/~~> As a particularly simple example, consider [[Inpainting\|image inpainting]]. The conditions are <math>\tilde x</math>, the reference image, and <math>m</math>, the inpainting [[Mask (computing)#Image masks\|mask]]. The conditioning is imposed at each step of the backward diffusion process, by first sampling <math>\tilde x_t \sim N\left(\sqrt{\bar\alpha_t} \tilde x, ~~(1-~~\~~bar\alpha_t)~~sigma_{t}^2 I \right)</math>, a noisy version of <math>\tilde x</math>, then replacing <math>x_t</math> with <math>(1-m) \odot x_t + m \odot \tilde x_t</math>, where <math>\odot</math> means [[Hadamard product (matrices)\|elementwise multiplication]].<ref>{{Cite ~~journal~~arXiv \|~~last~~eprint=2201.09865v4 \|last1=Lugmayr \|~~first~~first1=Andreas \|last2=Danelljan \|first2=Martin \|last3=Romero \|first3=Andres \|last4=Yu \|first4=Fisher \|last5=Timofte \|first5=Radu \|last6=Van Gool \|first6=Luc \|date=2022 \|title=RePaint: Inpainting Using Denoising Diffusion Probabilistic Models \|~~url~~class=~~https://openaccess~~cs.~~thecvf.com/content/CVPR2022/html/Lugmayr_RePaint_Inpainting_Using_Denoising_Diffusion_Probabilistic_Models_CVPR_2022_paper.html~~CV \|language=en}}</ref> Another application of cross-attention mechanism is prompt-to-prompt image editing.<ref>{{cite arXiv \|~~pages~~last1=~~11461–11471~~Hertz \|first1=Amir \|title=Prompt-to-Prompt Image Editing with Cross Attention Control \|date=2022-08-02 \|eprint=2208.01626 \|last2=Mokady \|first2=Ron \|last3=Tenenbaum \|first3=Jay \|last4=Aberman \|first4=Kfir \|last5=Pritch \|first5=Yael \|last6=Cohen-Or \|first6=Daniel\|class=cs.CV }}</ref> Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example,<ref name=":4" /> demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc. For how conditional diffusion models are mathematically formulated, see a methodological summary in.<ref>{{cite arXiv \|last1=Zhao \|first1=Zheng \|last2=Luo \|first2=Ziwei \|last3=Sjölund \|first3=Jens \|last4=Schön \|first4=Thomas B. \|title=Conditional sampling within generative diffusion models \|eprint=2409.09650 \|class=stat.ML \|date=2024}}</ref> === Upscaling === As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done by [[Generative adversarial network\|GAN]],<ref>{{Cite ~~journal~~conference \|last1=Wang \|first1=Xintao \|last2=Xie \|first2=Liangbin \|last3=Dong \|first3=Chao \|last4=Shan \|first4=Ying \|date=2021 \|title=Real-ESRGAN: Training Real-World Blind Super-Resolution With Pure Synthetic Data \|url=https://openaccess.thecvf.com/content/ICCV2021W/AIM/~~html~~papers/Wang_Real-ESRGAN_Training_Real-World_Blind_Super-Resolution_With_Pure_Synthetic_Data_ICCVW_2021_paper.~~html~~pdf \|conference=International Conference on Computer Vision \|book-title=Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021 \|language=en \|pages=1905–1914\|arxiv=2107.10833 }}</ref> [[Transformer (machine learning model)\|Transformer]],<ref>{{Cite ~~journal~~conference \|last1=Liang \|first1=Jingyun \|last2=Cao \|first2=Jiezhang \|last3=Sun \|first3=Guolei \|last4=Zhang \|first4=Kai \|last5=Van Gool \|first5=Luc \|last6=Timofte \|first6=Radu \|date=2021 \|title=SwinIR: Image Restoration Using Swin Transformer \|url=https://openaccess.thecvf.com/content/ICCV2021W/AIM/~~html~~papers/Liang_SwinIR_Image_Restoration_Using_Swin_Transformer_ICCVW_2021_paper.~~html~~pdf \|book-title=Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops \|conference=International Conference on Computer Vision, 2021 \|language=en \|pages=1833–1844\|arxiv=2108.~~10257~~10257v1 }}</ref> or signal processing methods like [[Lanczos resampling]]. Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style of [[StyleGAN#Progressive GAN\|Progressive GAN]]. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats.<ref name=":3" /> In more detail, the diffusion upscaler is trained as follows:<ref name=":3" /> * Sample <math>(x_0, z_0, c)</math>, where <math>x_0</math> is the high-resolution image, <math>z_0</math> is the same image but scaled down to a low-resolution, and <math>c</math> is the conditioning, which can be the caption of the image, the class of the image, etc. * Sample two white noises <math>\epsilon_x, \epsilon_z</math>, two time-steps <math>t_x, t_z</math>. Compute the noisy versions of the high-resolution and low-resolution images: <math>\begin{cases} x_{t_x} &= \sqrt{\bar\alpha_{t_x}} x_0 + \sigma_{t_x} \epsilon_x\\ z_{t_z} &= \sqrt{\bar\alpha_{t_z}} z_0 + \sigma_{t_z} \epsilon_z \end{cases} </math>. * Train the denoising network to predict <math>\epsilon_x</math> given <math>x_{t_x}, z_{t_z}, t_x, t_z, c</math>. That is, apply gradient descent on <math>\theta</math> on the L2 loss <math>\\| \epsilon_\theta(x_{t_x}, z_{t_z}, t_x, t_z, c) - \epsilon_x \\|_2^2</math>. == Examples == Line 213 ⟶ 356: === OpenAI === {{Main\|DALL-E\|Sora (text-to-video model)}} The DALL-E series by OpenAI are text-conditional diffusion models of images. The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that autoregressively generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text. GLIDE (2022-03)<ref>{{Cite arXiv \|eprint=2112.10741 \|class=cs.CV \|first1=Alex \|last1=Nichol \|first2=Prafulla \|last2=Dhariwal \|title=GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models \|date=2022-03-08 \|last3=Ramesh \|first3=Aditya \|last4=Shyam \|first4=Pranav \|last5=Mishkin \|first5=Pamela \|last6=McGrew \|first6=Bob \|last7=Sutskever \|first7=Ilya \|last8=Chen \|first8=Mark}}</ref> is a 3.5-billion diffusion model, and a small version was released publicly.<ref name="dalle2">{{Citation \|title=GLIDE \|date=2023-09-22 \|url=https://github.com/openai/glide-text2im \|access-date=2023-09-24 \|publisher=OpenAI}}</ref> Soon after, DALL-E 2 was released (2022-04).<ref>{{Cite arXiv \|eprint=2204.06125 \|class=cs.CV \|first1=Aditya \|last1=Ramesh \|first2=Prafulla \|last2=Dhariwal \|title=Hierarchical Text-Conditional Image Generation with CLIP Latents \|date=2022-04-12 \|last3=Nichol \|first3=Alex \|last4=Chu \|first4=Casey \|last5=Chen \|first5=Mark}}</ref> DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP". The unCLIP method contains 4 models: a CLIP image encoder, a CLIP text encoder, an image decoder, and a "prior" model (which can be a diffusion model, or an autoregressive model). During training, the prior model is trained to convert CLIP image encodings to CLIP text encodings. The image decoder is trained to convert CLIP image encodings back to images. During inference, a text is converted by the CLIP text encoder to a vector, then it is converted by the prior model to an image encoding, then it is converted by the image decoder to an image. [[Sora (text-to-video model)\|Sora]] (2024-02) is a diffusion Transformer model (DiT). === Stability AI === {{Main\|Stable Diffusion}} [[Stable Diffusion]] (2022-08), released by Stability AI, consists of a latent diffusion model (860 million parameters), a VAE, and a text encoder. The diffusion model is a U-Net, with cross-attention blocks to allow for conditional image generation.<ref name=":02">{{Cite web \|last=Alammar \|first=Jay \|title=The Illustrated Stable Diffusion \|url=https://jalammar.github.io/illustrated-stable-diffusion/ \|access-date=2022-10-31 \|website=jalammar.github.io}}</ref><ref name=":2" /> [[Stable Diffusion]] (2022-08), released by Stability AI, consists of a denoising latent diffusion model (860 million parameters), a VAE, and a text encoder. The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation.<ref name=":02">{{Cite web \|last=Alammar \|first=Jay \|title=The Illustrated Stable Diffusion \|url=https://jalammar.github.io/illustrated-stable-diffusion/ \|access-date=2022-10-31 \|website=jalammar.github.io}}</ref><ref name=":2" /> Stable Diffusion 3 (2024-03)<ref name=":6">{{cite arXiv \|last1=Esser \|first1=Patrick \|title=Scaling Rectified Flow Transformers for High-Resolution Image Synthesis \|date=2024-03-05 \|eprint=2403.03206 \|last2=Kulal \|first2=Sumith \|last3=Blattmann \|first3=Andreas \|last4=Entezari \|first4=Rahim \|last5=Müller \|first5=Jonas \|last6=Saini \|first6=Harry \|last7=Levi \|first7=Yam \|last8=Lorenz \|first8=Dominik \|last9=Sauer \|first9=Axel\|class=cs.CV }}</ref> changed the latent diffusion model from the UNet to a Transformer model, and so it is a DiT. It uses rectified flow. Stable Video 4D (2024-07)<ref>{{cite arXiv \|last1=Xie \|first1=Yiming \|title=SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency \|date=2024-07-24 \|eprint=2407.17470 \|last2=Yao \|first2=Chun-Han \|last3=Voleti \|first3=Vikram \|last4=Jiang \|first4=Huaizu \|last5=Jampani \|first5=Varun\|class=cs.CV }}</ref> is a latent diffusion model for videos of 3D objects. === Google === Imagen (2022)<ref>{{Cite web \|title=Imagen: Text-to-Image Diffusion Models \|url=https://imagen.research.google/ \|access-date=2024-04-04 \|website=imagen.research.google}}</ref><ref>{{Cite journal \|last1=Saharia \|first1=Chitwan \|last2=Chan \|first2=William \|last3=Saxena \|first3=Saurabh \|last4=Li \|first4=Lala \|last5=Whang \|first5=Jay \|last6=Denton \|first6=Emily L. \|last7=Ghasemipour \|first7=Kamyar \|last8=Gontijo Lopes \|first8=Raphael \|last9=Karagol Ayan \|first9=Burcu \|last10=Salimans \|first10=Tim \|last11=Ho \|first11=Jonathan \|last12=Fleet \|first12=David J. \|last13=Norouzi \|first13=Mohammad \|date=2022-12-06 \|title=Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding \|url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=35 \|pages=36479–36494\|arxiv=2205.11487 }}</ref> uses a [[T5 (language model)\|T5-XXL language model]] to encode the input text into an embedding vector. It is a cascaded diffusion model with three sub-models. The first step denoises a white noise to a 64×64 image, conditional on the embedding vector of the text. This model has 2B parameters. The second step upscales the image by 64×64→256×256, conditional on embedding. This model has 650M parameters. The third step is similar, upscaling by 256×256→1024×1024. This model has 400M parameters. The three denoising networks are all U-Nets. Muse (2023-01)<ref>{{cite arXiv \|last1=Chang \|first1=Huiwen \|title=Muse: Text-To-Image Generation via Masked Generative Transformers \|date=2023-01-02 \|eprint=2301.00704 \|last2=Zhang \|first2=Han \|last3=Barber \|first3=Jarred \|last4=Maschinot \|first4=A. J. \|last5=Lezama \|first5=Jose \|last6=Jiang \|first6=Lu \|last7=Yang \|first7=Ming-Hsuan \|last8=Murphy \|first8=Kevin \|last9=Freeman \|first9=William T.\|class=cs.CV }}</ref> is not a diffusion model, but an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens. Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available.<ref>{{Cite web \|title=Imagen 2 - our most advanced text-to-image technology \|url=https://deepmind.google/technologies/imagen-2/ \|access-date=2024-04-04 \|website=Google DeepMind \|language=en}}</ref> Imagen 3 (2024-05) is too. No further information available.<ref>{{Citation \|last1=Imagen-Team-Google \|title=Imagen 3 \|date=2024-12-13 \|arxiv=2408.07009 \|last2=Baldridge \|first2=Jason \|last3=Bauer \|first3=Jakob \|last4=Bhutani \|first4=Mukul \|last5=Brichtova \|first5=Nicole \|last6=Bunner \|first6=Andrew \|last7=Castrejon \|first7=Lluis \|last8=Chan \|first8=Kelvin \|last9=Chen \|first9=Yichang}}</ref> Veo (2024) generates videos by latent diffusion. The diffusion is conditioned on a vector that encodes both a text prompt and an image prompt.<ref>{{Cite web \|date=2024-05-14 \|title=Veo \|url=https://deepmind.google/technologies/veo/ \|access-date=2024-05-17 \|website=Google DeepMind \|language=en}}</ref> === Meta === Make-A-Video (2022) is a text-to-video diffusion model.<ref>{{Cite web \|url=https://ai.meta.com/blog/generative-ai-text-to-video/ \|access-date=2024-09-20\|title=Introducing Make-A-Video: An AI system that generates videos from text \|website=ai.meta.com}}</ref><ref>{{cite arXiv \|last1=Singer \|first1=Uriel \|title=Make-A-Video: Text-to-Video Generation without Text-Video Data \|date=2022-09-29 \|eprint=2209.14792 \|last2=Polyak \|first2=Adam \|last3=Hayes \|first3=Thomas \|last4=Yin \|first4=Xi \|last5=An \|first5=Jie \|last6=Zhang \|first6=Songyang \|last7=Hu \|first7=Qiyuan \|last8=Yang \|first8=Harry \|last9=Ashual \|first9=Oron\|class=cs.CV }}</ref> CM3leon (2023) is not a diffusion model, but an autoregressive causally masked Transformer, with mostly the same architecture as [[Llama (language model)\|LLaMa]]-2.<ref>{{Cite web \|url=https://ai.meta.com/blog/generative-ai-text-images-cm3leon/ \|title=Introducing CM3leon, a more efficient, state-of-the-art generative model for text and images\|access-date=2024-09-20 \|website=ai.meta.com}}</ref><ref>{{cite arXiv \|last=Chameleon Team \|title=Chameleon: Mixed-Modal Early-Fusion Foundation Models \|date=2024-05-16 \|class=cs.CL \|eprint=2405.09818}}</ref> [[File:Transfusion_(2024)_Fig_1,_architectural_diagram.png\|thumb\|279x279px\|Transfusion architectural diagram]] Transfusion (2024) is a Transformer that combines autoregressive text generation and denoising diffusion. Specifically, it generates text autoregressively (with causal masking), and generates images by denoising multiple times over image tokens (with all-to-all attention).<ref>{{Cite arXiv \|last1=Zhou \|first1=Chunting \|last2=Yu \|first2=Lili \|last3=Babu \|first3=Arun \|last4=Tirumala \|first4=Kushal \|last5=Yasunaga \|first5=Michihiro \|last6=Shamis \|first6=Leonid \|last7=Kahn \|first7=Jacob \|last8=Ma \|first8=Xuezhe \|last9=Zettlemoyer \|first9=Luke \|date=2024-08-20 \|title=Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model \|class=cs.AI \|eprint=2408.11039 \|language=en}}</ref> Movie Gen (2024) is a series of Diffusion Transformers operating on latent space and by flow matching.<ref>''[https://ai.meta.com/static-resource/movie-gen-research-paper Movie Gen: A Cast of Media Foundation Models]'', The Movie Gen team @ Meta, October 4, 2024.</ref> ==See also== Line 231 ⟶ 401: == Further reading == * Review papers * [https://benanne.github.io/2022/05/26/guidance.html Guidance: a cheat code for diffusion models]. Overview of classifier guidance and classifier-free guidance, light on mathematical details. {{Citation \|last=Yang \|first=Ling \|title=YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy \|date=2024-09-06 \|url=https://github.com/YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy \|access-date=2024-09-06}} {{Cite journal \|last1=Yang \|first1=Ling \|last2=Zhang \|first2=Zhilong \|last3=Song \|first3=Yang \|last4=Hong \|first4=Shenda \|last5=Xu \|first5=Runsheng \|last6=Zhao \|first6=Yue \|last7=Zhang \|first7=Wentao \|last8=Cui \|first8=Bin \|last9=Yang \|first9=Ming-Hsuan \|date=2023-11-09 \|title=Diffusion Models: A Comprehensive Survey of Methods and Applications \|url=https://dl.acm.org/doi/abs/10.1145/3626235 \|journal=ACM Comput. Surv. \|volume=56 \|issue=4 \|pages=105:1–105:39 \|doi=10.1145/3626235 \|issn=0360-0300\|arxiv=2209.00796 }} {{ Cite arXiv \| eprint=2107.03006 \| last1=Austin \| first1=Jacob \| last2=Johnson \| first2=Daniel D. \| last3=Ho \| first3=Jonathan \| last4=Tarlow \| first4=Daniel \| author5=Rianne van den Berg \| title=Structured Denoising Diffusion Models in Discrete State-Spaces \| date=2021 \| class=cs.LG }} {{Cite journal \|last1=Croitoru \|first1=Florinel-Alin \|last2=Hondru \|first2=Vlad \|last3=Ionescu \|first3=Radu Tudor \|last4=Shah \|first4=Mubarak \|date=2023-09-01 \|title=Diffusion Models in Vision: A Survey \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence \|volume=45 \|issue=9 \|pages=10850–10869 \|doi=10.1109/TPAMI.2023.3261988 \|pmid=37030794 \|issn=0162-8828\|arxiv=2209.04747 \|bibcode=2023ITPAM..4510850C }} * Mathematical details omitted in the article. {{Cite web \|date=2022-09-25 \|title=Power of Diffusion Models \|url=https://astralord.github.io/posts/power-of-diffusion-models/ \|access-date=2023-09-25 \|website=AstraBlog \|language=en}} {{Cite arXiv \|last=Luo \|first=Calvin \|date=2022-08-25 \|title=Understanding Diffusion Models: A Unified Perspective \|class=cs.LG \|eprint=2208.11970}} ** {{Cite web \|last=Weng \|first=Lilian \|date=2021-07-11 \|title=What are Diffusion Models? \|url=https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ \|access-date=2023-09-25 \|website=lilianweng.github.io \|language=en}} * Tutorials {{Cite arXiv \|eprint=2406.08929 \|first1=Preetum \|last1=Nakkiran \|first2=Arwen \|last2=Bradley \|title=Step-by-Step Diffusion: An Elementary Tutorial \|date=2024 \|last3=Zhou \|first3=Hattie \|last4=Advani \|first4=Madhu\|class=cs.LG }} {{Cite web \|url=https://benanne.github.io/2022/05/26/guidance.html \|title=Guidance: a cheat code for diffusion models\|date=26 May 2022 }} Overview of classifier guidance and classifier-free guidance, light on mathematical details. ==References== Line 241 ⟶ 419: [[Category:Markov models]] [[Category:Machine learning algorithms]] __FORCETOC__