Diffusion model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 02:02, 27 February 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,305 edits →Their equivalence Tag: Visual edit ← Previous edit		Latest revision as of 14:52, 25 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,869,244 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 658/1032
(38 intermediate revisions by 18 users not shown)
Line 1: {{Short description\|Deep learning algorithm}}{{About\|the technique in generative statistical modeling\|3=Diffusion (disambiguation)}} {{Machine learning\|Artificial neural network}} In [[machine learning]], '''diffusion models''', also known as '''diffusion-based ~~probabilistic~~generative models''' or '''score-based generative models''', are a class of [[latent variable model\|latent variable]] [[generative model\|generative]] models. A diffusion model consists of ~~three~~two major components: the forward ~~process, the reverse~~diffusion process, and the reverse sampling procedure.<ref name="chang23design">{{cite arXiv \|last1=Chang \|first1=Ziyi \|last2=Koulieris \|first2=George Alex \|last3=Shum \|first3=Hubert P. H. \|title=On the Design Fundamentals of Diffusion Models: A Survey \|date=2023 \|eprint=2306.04542 \|class=csprocess.~~LG}}</ref>~~ The goal of diffusion models is to learn a [[diffusion process]] for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a [[Wiener process\|random walk with drift]] through the space of all possible data.<ref name="song"/> A trained diffusion model can be sampled in many ways, with different efficiency and quality. There are various equivalent formalisms, including [[Markov chain]]s, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.<ref>{{cite journal \|last1=Croitoru \|first1=Florinel-Alin \|last2=Hondru \|first2=Vlad \|last3=Ionescu \|first3=Radu Tudor \|last4=Shah \|first4=Mubarak \|date=2023 \|title=Diffusion Models in Vision: A Survey \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence \|volume=45 \|issue=9 \|pages=10850–10869 \|arxiv=2209.04747 \|doi=10.1109/TPAMI.2023.3261988 \|pmid=37030794 \|bibcode=2023ITPAM..4510850C \|s2cid=252199918}}</ref> They are typically trained using [[Variational Bayesian methods\|variational inference]].<ref name="ho" /> The model responsible for denoising is typically called its "[[#Choice of architecture\|backbone]]". The backbone may be of any kind, but they are typically [[U-Net\|U-nets]] or [[Transformer (deep learning architecture)\|transformers]]. {{As of\|2024}}, diffusion models are mainly used for [[computer vision]] tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], [[text-to-image model\|image generation]], and video generation. These typically involve training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv \|last1=Song \|first1=Yang \|last2=Sohl-Dickstein \|first2=Jascha \|last3=Kingma \|first3=Diederik P. \|last4=Kumar \|first4=Abhishek \|last5=Ermon \|first5=Stefano \|last6=Poole \|first6=Ben \|date=2021-02-10 \|title=Score-Based Generative Modeling through Stochastic Differential Equations \|class=cs.LG \|eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv \|last1=Gu \|first1=Shuyang \|last2=Chen \|first2=Dong \|last3=Bao \|first3=Jianmin \|last4=Wen \|first4=Fang \|last5=Zhang \|first5=Bo \|last6=Chen \|first6=Dongdong \|last7=Yuan \|first7=Lu \|last8=Guo \|first8=Baining \|title=Vector Quantized Diffusion Model for Text-to-Image Synthesis \|date=2021 \|class=cs.CV \|eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Line 9 ⟶ 10: Diffusion-based image generators have seen widespread commercial interest, such as [[Stable Diffusion]] and [[DALL-E]]. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.<ref name="dalle2" /> Other than computer vision, diffusion models have also found applications in [[natural language processing]]<ref>{{ Cite arXiv \|eprint=2410.18514 \|last1=Nie \|first1=Shen \|last2=Zhu \|first2=Fengqi \|last3=Du \|first3=Chao \|last4=Pang \|first4=Tianyu \|last5=Liu \|first5=Qian \|last6=Zeng \|first6=Guangtao \|last7=Lin \|first7=Min \|last8=Li \|first8=Chongxuan \|title=Scaling up Masked Diffusion Models on Text \|date=2024 \|class=cs.AI }}</ref><ref>{{ Cite book \|last1=Li \|first1=Yifan \|last2=Zhou \|first2=Kun \|last3=Zhao \|first3=Wayne Xin \|last4=Wen \|first4=Ji-Rong \|chapter=Diffusion Models for Non-autoregressive Text Generation: A Survey \|date=August 2023 \|pages=6692–6701 \|title=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence \|chapter-url=http://dx.doi.org/10.24963/ijcai.2023/750 \|___location=California \|publisher=International Joint Conferences on Artificial Intelligence Organization \|doi=10.24963/ijcai.2023/750\|arxiv=2303.06574 \|isbn=978-1-956792-03-4 }}</ref> such as [[Natural language generation\|text generation]]<ref>{{Cite journal \|last1=Han \|first1=Xiaochuang \|last2=Kumar \|first2=Sachin \|last3=Tsvetkov \|first3=Yulia \|date=2023 \|title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control \|url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 \|journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|pages=11575–11596 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.acl-long.647\|arxiv=2210.17432 }}</ref><ref>{{Cite journal \|last1=Xu \|first1=Weijie \|last2=Hu \|first2=Wenxiang \|last3=Wu \|first3=Fanyou \|last4=Sengamedu \|first4=Srinivasan \|date=2023 \|title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM \|url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 \|journal=Findings of the Association for Computational Linguistics: EMNLP 2023 \|pages=9040–9057 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-emnlp.606\|arxiv=2310.15296 }}</ref> and [[Automatic summarization\|summarization]],<ref>{{Cite journal \|last1=Zhang \|first1=Haopeng \|last2=Liu \|first2=Xiao \|last3=Zhang \|first3=Jiawei \|date=2023 \|title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion \|url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 \|journal=Findings of the Association for Computational Linguistics: ACL 2023 \|pages=13089–13100 \|___location=Stroudsburg, PA, USA \|publisher=Association for Computational Linguistics \|doi=10.18653/v1/2023.findings-acl.828\|arxiv=2305.01735 }}</ref> sound generation,<ref>{{Cite journal \|last1=Yang \|first1=Dongchao \|last2=Yu \|first2=Jianwei \|last3=Wang \|first3=Helin \|last4=Wang \|first4=Wen \|last5=Weng \|first5=Chao \|last6=Zou \|first6=Yuexian \|last7=Yu \|first7=Dong \|date=2023 \|title=Diffsound: Discrete Diffusion Model for Text-to-Sound Generation \|url=http://dx.doi.org/10.1109/taslp.2023.3268730 \|journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing \|volume=31 \|pages=1720–1733 \|doi=10.1109/taslp.2023.3268730 \|issn=2329-9290\|arxiv=2207.09983 \|bibcode=2023ITASL..31.1720Y }}</ref> and reinforcement learning.<ref>{{cite arXiv \|last1=Janner \|first1=Michael \|title=Planning with Diffusion for Flexible Behavior Synthesis \|date=2022-12-20 \|eprint=2205.09991 \|last2=Du \|first2=Yilun \|last3=Tenenbaum \|first3=Joshua B. \|last4=Levine \|first4=Sergey\|class=cs.LG }}</ref><ref>{{cite arXiv \|last1=Chi \|first1=Cheng \|title=Diffusion Policy: Visuomotor Policy Learning via Action Diffusion \|date=2024-03-14 \|eprint=2303.04137 \|last2=Xu \|first2=Zhenjia \|last3=Feng \|first3=Siyuan \|last4=Cousineau \|first4=Eric \|last5=Du \|first5=Yilun \|last6=Burchfiel \|first6=Benjamin \|last7=Tedrake \|first7=Russ \|last8=Song \|first8=Shuran\|class=cs.RO }}</ref> == Denoising diffusion model == === Non-equilibrium thermodynamics === Diffusion models were introduced in 2015 as a method to ~~learn~~train a model that can sample from a highly complex probability distribution. They used techniques from [[non-equilibrium thermodynamics]], especially [[diffusion]].<ref>{{Cite journal \|last1=Sohl-Dickstein \|first1=Jascha \|last2=Weiss \|first2=Eric \|last3=Maheswaranathan \|first3=Niru \|last4=Ganguli \|first4=Surya \|date=2015-06-01 \|title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics \|url=http://proceedings.mlr.press/v37/sohl-dickstein15.pdf \|journal=Proceedings of the 32nd International Conference on Machine Learning \|language=en \|publisher=PMLR \|volume=37 \|pages=2256–2265\|arxiv=1503.03585 }}</ref> Consider, for example, how one might model the distribution of all naturally- occurring photos. Each image is a point in the space of all images, and the distribution of naturally- occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a [[Normal distribution\|Gaussian distribution]] <math>\mathcal{N}(0, I)</math>. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution <math>\mathcal{N}(0, I)</math>, with pdf <math>\rho(x) \propto e^{-\frac 12 \\|x\\|^2}</math>. This is just the [[Maxwell–Boltzmann distribution]] of particles in a potential well <math>V(x) = \frac 12 \\|x\\|^2</math> at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a [[Brownian motion\|Brownian walker]]) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. Line 111 ⟶ 112: ==== Learning the score function ==== Given a density <math>q</math>, we wish to learn a score function approximation <math>f_\theta \approx \nabla \ln q</math>. This is '''score matching'''''.''<ref>{{Cite web \|title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation {{!}} Yang Song \|url=https://yang-song.net/blog/2019/ssm/ \|access-date=2023-09-24 \|website=yang-song.net}}</ref> Typically, score matching is formalized as minimizing '''Fisher divergence''' function <math>E_q[\\|f_\theta(x) - \nabla \ln q(x)\\|^2]</math>. By expanding the integral, and performing an integration by parts, <math display="block">E_q[\\|f_\theta(x) - \nabla \ln q(x)\\|^2] = E_q[\\|f_\theta\\|^2 + 2\nabla^2\cdot f_\theta] + C</math>giving us a loss function, also known as the [[Scoring rule#Hyvärinen scoring rule\|Hyvärinen scoring rule]], that can be minimized by stochastic gradient descent. ==== Annealing the score function ==== Line 130 ⟶ 131: ==== Backward diffusion process ==== If we have solved <math>\rho_t</math> for time <math>t\in [0, T]</math>, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density <math>\nu_0 = \rho_T</math>, and let the particles in the cloud evolve according to <math display="block">dy_t = \frac{1}{2} \beta(T-t) y_{t} d t + \beta(T-t) \underbrace{\nabla_{y_{t}} \ln \rho_{T-t}\left(y_{t}\right)}_{\text {score function }} d t+\sqrt{\beta(T-t)} d W_t</math> then by plugging into the Fokker-Planck equation, we find that <math>\partial_t \rho_{T-t} = \partial_t \nu_t</math>. Thus this cloud of points is the original cloud, evolving backwards.<ref>{{Cite journal \|last=Anderson \|first=Brian D.O. \|date=May 1982 \|title=Reverse-time diffusion equation models \|url=http://dx.doi.org/10.1016/0304-4149(82)90051-5 \|journal=Stochastic Processes and Their Applications \|volume=12 \|issue=3 \|pages=313–326 \|doi=10.1016/0304-4149(82)90051-5 \|issn=0304-4149\|url-access=subscription }}</ref> === Noise conditional score network (NCSN) === Line 260 ⟶ 265: === Other examples === Notable variants include<ref>{{Cite journal \|last1=Cao \|first1=Hanqun \|last2=Tan \|first2=Cheng \|last3=Gao \|first3=Zhangyang \|last4=Xu \|first4=Yilun \|last5=Chen \|first5=Guangyong \|last6=Heng \|first6=Pheng-Ann \|last7=Li \|first7=Stan Z. \|date=July 2024 \|title=A Survey on Generative Diffusion Models ~~\|url=https://ieeexplore.ieee.org/document/10419041~~ \|journal=IEEE Transactions on Knowledge and Data Engineering \|volume=36 \|issue=7 \|pages=2814–2830 \|doi=10.1109/TKDE.2024.3361474 \|bibcode=2024ITKDE..36.2814C \|issn=1041-4347}}</ref> Poisson flow generative model,<ref>{{Cite journal \|last1=Xu \|first1=Yilun \|last2=Liu \|first2=Ziming \|last3=Tian \|first3=Yonglong \|last4=Tong \|first4=Shangyuan \|last5=Tegmark \|first5=Max \|last6=Jaakkola \|first6=Tommi \|date=2023-07-03 \|title=PFGM++: Unlocking the Potential of Physics-Inspired Generative Models \|url=https://proceedings.mlr.press/v202/xu23m.html \|journal=Proceedings of the 40th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=38566–38591\|arxiv=2302.04265 }}</ref> consistency model,<ref>{{Cite journal \|last1=Song \|first1=Yang \|last2=Dhariwal \|first2=Prafulla \|last3=Chen \|first3=Mark \|last4=Sutskever \|first4=Ilya \|date=2023-07-03 \|title=Consistency Models \|url=https://proceedings.mlr.press/v202/song23a \|journal=Proceedings of the 40th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=32211–32252}}</ref> critically- damped Langevin diffusion,<ref>{{Cite arXiv \|last1=Dockhorn \|first1=Tim \|last2=Vahdat \|first2=Arash \|last3=Kreis \|first3=Karsten \|date=2021-10-06 \|title=Score-Based Generative Modeling with Critically-Damped Langevin Diffusion \|class=stat.ML \|eprint=2112.07068 }}</ref> GenPhys,<ref>{{cite arXiv \|last1=Liu \|first1=Ziming \|title=GenPhys: From Physical Processes to Generative Models \|date=2023-04-05 \|eprint=2304.02637 \|last2=Luo \|first2=Di \|last3=Xu \|first3=Yilun \|last4=Jaakkola \|first4=Tommi \|last5=Tegmark \|first5=Max\|class=cs.LG }}</ref> cold diffusion,<ref>{{Cite journal \|last1=Bansal \|first1=Arpit \|last2=Borgnia \|first2=Eitan \|last3=Chu \|first3=Hong-Min \|last4=Li \|first4=Jie \|last5=Kazemi \|first5=Hamid \|last6=Huang \|first6=Furong \|last7=Goldblum \|first7=Micah \|last8=Geiping \|first8=Jonas \|last9=Goldstein \|first9=Tom \|date=2023-12-15 \|title=Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise \|url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/80fe51a7d8d0c73ff7439c2a2554ed53-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=36 \|pages=41259–41282\|arxiv=2208.09392 }}</ref> discrete diffusion,<ref>{{Cite journal \|last1=Gulrajani \|first1=Ishaan \|last2=Hashimoto \|first2=Tatsunori B. \|date=2023-12-15 \|title=Likelihood-Based Diffusion Language Models \|url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/35b5c175e139bff5f22a5361270fce87-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=36 \|pages=16693–16715\|arxiv=2305.18619 }}</ref><ref>{{cite arXiv \|last1=Lou \|first1=Aaron \|title=Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution \|date=2024-06-06 \|eprint=2310.16834 \|last2=Meng \|first2=Chenlin \|last3=Ermon \|first3=Stefano\|class=stat.ML }}</ref> etc. == Flow-based diffusion model == Line 305 ⟶ 310: <math display="block">\min_{\theta} \int_0^1 \mathbb{E}_{\pi_0, \pi_1, p_t}\left [\lVert{(x_1-x_0) - v_t(x_t)}\rVert^2\right] \,\mathrm{d}t.</math> The data pair <math>(x_0, x_1)</math> can be any coupling of <math>\pi_0</math> and <math>\pi_1</math>, typically independent (i.e., <math>(x_0,x_1) \sim \pi_0 \times \pi_1</math>) obtained by randomly combining observations from <math>\pi_0</math> and <math>\pi_1</math>. This process ensures that the trajectories closely mirror the density map of <math>x_t</math> trajectories but ''reroute'' at intersections to ensure causality. This rectifying process is also known as Flow Matching,<ref>{{cite arXiv \|last1=Lipman \|first1=Yaron \|title=Flow Matching for Generative Modeling \|date=2023-02-08 \|eprint=2210.02747 \|last2=Chen \|first2=Ricky T. Q. \|last3=Ben-Hamu \|first3=Heli \|last4=Nickel \|first4=Maximilian \|last5=Le \|first5=Matt\|class=cs.LG }}</ref> Stochastic Interpolation,<ref>{{cite arXiv \|last1=Albergo \|first1=Michael S. \|title=Building Normalizing Flows with Stochastic Interpolants \|date=2023-03-09 \|eprint=2209.15571 \|last2=Vanden-Eijnden \|first2=Eric\|class=cs.LG }}</ref> and alpha-(de)blending.<ref>{{cite arXiv \|last1=Heitz \|first1=Eric \|title=Iterative α-(de)Blending: a Minimalist Deterministic Diffusion Model \|date=2023-05-05 \|eprint=2305.03486 \|last2=Belcour \|first2=Laurent \|last3=Chambon \| first3=Thomas \|class=cs.GR}}</ref> [[File:Reflow Illustration.png\|thumb\|390px\|The reflow process<ref name=":0"/>]] Line 312 ⟶ 317: Rectified flow includes a nonlinear extension where linear interpolation <math>x_t</math> is replaced with any time-differentiable curve that connects <math>x_0</math> and <math>x_1</math>, given by <math>x_t = \alpha_t x_1 + \beta_t x_0</math>. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of <math>\alpha_t</math> and <math>\beta_t</math>. However, in the case where the path of <math>x_t</math> is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of <math>\phi_t</math>.<ref name=":0" /> See <ref>{{Cite web \|title=An introduction to Flow Matching · Cambridge MLG Blog \|url=https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html \|access-date=2024-08-20 \|website=mlg.eng.cam.ac.uk}}</ref> for a tutorial on flow matching, with animations. == Choice of architecture == Line 402 ⟶ 405: {{Cite journal \|last1=Yang \|first1=Ling \|last2=Zhang \|first2=Zhilong \|last3=Song \|first3=Yang \|last4=Hong \|first4=Shenda \|last5=Xu \|first5=Runsheng \|last6=Zhao \|first6=Yue \|last7=Zhang \|first7=Wentao \|last8=Cui \|first8=Bin \|last9=Yang \|first9=Ming-Hsuan \|date=2023-11-09 \|title=Diffusion Models: A Comprehensive Survey of Methods and Applications \|url=https://dl.acm.org/doi/abs/10.1145/3626235 \|journal=ACM Comput. Surv. \|volume=56 \|issue=4 \|pages=105:1–105:39 \|doi=10.1145/3626235 \|issn=0360-0300\|arxiv=2209.00796 }} {{ Cite arXiv \| eprint=2107.03006 \| last1=Austin \| first1=Jacob \| last2=Johnson \| first2=Daniel D. \| last3=Ho \| first3=Jonathan \| last4=Tarlow \| first4=Daniel \| author5=Rianne van den Berg \| title=Structured Denoising Diffusion Models in Discrete State-Spaces \| date=2021 \| class=cs.LG }} ** {{Cite journal \|last1=Croitoru \|first1=Florinel-Alin \|last2=Hondru \|first2=Vlad \|last3=Ionescu \|first3=Radu Tudor \|last4=Shah \|first4=Mubarak \|date=2023-09-01 \|title=Diffusion Models in Vision: A Survey ~~\|url=https://ieeexplore.ieee.org/document/10081412~~ \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence \|volume=45 \|issue=9 \|pages=10850–10869 \|doi=10.1109/TPAMI.2023.3261988 \|pmid=37030794 \|issn=0162-8828\|arxiv=2209.04747 \|bibcode=2023ITPAM..4510850C }} * Mathematical details omitted in the article. ** {{Cite web \|date=2022-09-25 \|title=Power of Diffusion Models \|url=https://astralord.github.io/posts/power-of-diffusion-models/ \|access-date=2023-09-25 \|website=AstraBlog \|language=en}} Line 416 ⟶ 419: [[Category:Markov models]] [[Category:Machine learning algorithms]] __FORCETOC__