Content deleted Content added
Citation bot (talk | contribs) Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 658/1032 |
|||
(38 intermediate revisions by 18 users not shown) | |||
Line 1:
{{Short description|Deep learning algorithm}}{{About|the technique in generative statistical modeling|3=Diffusion (disambiguation)}}
{{Machine learning|Artificial neural network}} In [[machine learning]], '''diffusion models''', also known as '''diffusion-based
There are various equivalent formalisms, including [[Markov chain]]s, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.<ref>{{cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023 |title=Diffusion Models in Vision: A Survey |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=9 |pages=10850–10869 |arxiv=2209.04747 |doi=10.1109/TPAMI.2023.3261988 |pmid=37030794 |bibcode=2023ITPAM..4510850C |s2cid=252199918}}</ref> They are typically trained using [[Variational Bayesian methods|variational inference]].<ref name="ho" /> The model responsible for denoising is typically called its "[[#Choice of architecture|backbone]]". The backbone may be of any kind, but they are typically [[U-Net|U-nets]] or [[Transformer (deep learning architecture)|transformers]].
{{As of|2024}}, diffusion models are mainly used for [[computer vision]] tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], [[text-to-image model|image generation]], and video generation. These typically involve training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv |last1=Song |first1=Yang |last2=Sohl-Dickstein |first2=Jascha |last3=Kingma |first3=Diederik P. |last4=Kumar |first4=Abhishek |last5=Ermon |first5=Stefano |last6=Poole |first6=Ben |date=2021-02-10 |title=Score-Based Generative Modeling through Stochastic Differential Equations |class=cs.LG |eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv |last1=Gu |first1=Shuyang |last2=Chen |first2=Dong |last3=Bao |first3=Jianmin |last4=Wen |first4=Fang |last5=Zhang |first5=Bo |last6=Chen |first6=Dongdong |last7=Yuan |first7=Lu |last8=Guo |first8=Baining |title=Vector Quantized Diffusion Model for Text-to-Image Synthesis |date=2021 |class=cs.CV |eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.
Line 9 ⟶ 10:
Diffusion-based image generators have seen widespread commercial interest, such as [[Stable Diffusion]] and [[DALL-E]]. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.<ref name="dalle2" />
Other than computer vision, diffusion models have also found applications in [[natural language processing]]<ref>{{ Cite arXiv |eprint=2410.18514 |last1=Nie |first1=Shen |last2=Zhu |first2=Fengqi |last3=Du |first3=Chao |last4=Pang |first4=Tianyu |last5=Liu |first5=Qian |last6=Zeng |first6=Guangtao |last7=Lin |first7=Min |last8=Li |first8=Chongxuan |title=Scaling up Masked Diffusion Models on Text |date=2024 |class=cs.AI }}</ref><ref>{{ Cite book |last1=Li |first1=Yifan |last2=Zhou |first2=Kun |last3=Zhao |first3=Wayne Xin |last4=Wen |first4=Ji-Rong |chapter=Diffusion Models for Non-autoregressive Text Generation: A Survey |date=August 2023 |pages=6692–6701 |title=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence |chapter-url=http://dx.doi.org/10.24963/ijcai.2023/750 |___location=California |publisher=International Joint Conferences on Artificial Intelligence Organization |doi=10.24963/ijcai.2023/750|arxiv=2303.06574 |isbn=978-1-956792-03-4 }}</ref> such as [[Natural language generation|text generation]]<ref>{{Cite journal |last1=Han |first1=Xiaochuang |last2=Kumar |first2=Sachin |last3=Tsvetkov |first3=Yulia |date=2023 |title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control |url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 |journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |pages=11575–11596 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.acl-long.647|arxiv=2210.17432 }}</ref><ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Hu |first2=Wenxiang |last3=Wu |first3=Fanyou |last4=Sengamedu |first4=Srinivasan |date=2023 |title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM |url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 |journal=Findings of the Association for Computational Linguistics: EMNLP 2023 |pages=9040–9057 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-emnlp.606|arxiv=2310.15296 }}</ref> and [[Automatic summarization|summarization]],<ref>{{Cite journal |last1=Zhang |first1=Haopeng |last2=Liu |first2=Xiao |last3=Zhang |first3=Jiawei |date=2023 |title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion |url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 |journal=Findings of the Association for Computational Linguistics: ACL 2023 |pages=13089–13100 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-acl.828|arxiv=2305.01735 }}</ref> sound generation,<ref>{{Cite journal |last1=Yang |first1=Dongchao |last2=Yu |first2=Jianwei |last3=Wang |first3=Helin |last4=Wang |first4=Wen |last5=Weng |first5=Chao |last6=Zou |first6=Yuexian |last7=Yu |first7=Dong |date=2023 |title=Diffsound: Discrete Diffusion Model for Text-to-Sound Generation |url=http://dx.doi.org/10.1109/taslp.2023.3268730 |journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing |volume=31 |pages=1720–1733 |doi=10.1109/taslp.2023.3268730 |issn=2329-9290|arxiv=2207.09983 |bibcode=2023ITASL..31.1720Y }}</ref> and reinforcement learning.<ref>{{cite arXiv |last1=Janner |first1=Michael |title=Planning with Diffusion for Flexible Behavior Synthesis |date=2022-12-20 |eprint=2205.09991 |last2=Du |first2=Yilun |last3=Tenenbaum |first3=Joshua B. |last4=Levine |first4=Sergey|class=cs.LG }}</ref><ref>{{cite arXiv |last1=Chi |first1=Cheng |title=Diffusion Policy: Visuomotor Policy Learning via Action Diffusion |date=2024-03-14 |eprint=2303.04137 |last2=Xu |first2=Zhenjia |last3=Feng |first3=Siyuan |last4=Cousineau |first4=Eric |last5=Du |first5=Yilun |last6=Burchfiel |first6=Benjamin |last7=Tedrake |first7=Russ |last8=Song |first8=Shuran|class=cs.RO }}</ref>
== Denoising diffusion model ==
=== Non-equilibrium thermodynamics ===
Diffusion models were introduced in 2015 as a method to
Consider, for example, how one might model the distribution of all naturally
The equilibrium distribution is the Gaussian distribution <math>\mathcal{N}(0, I)</math>, with pdf <math>\rho(x) \propto e^{-\frac 12 \|x\|^2}</math>. This is just the [[Maxwell–Boltzmann distribution]] of particles in a potential well <math>V(x) = \frac 12 \|x\|^2</math> at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a [[Brownian motion|Brownian walker]]) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.
Line 111 ⟶ 112:
==== Learning the score function ====
Given a density <math>q</math>, we wish to learn a score function approximation <math>f_\theta \approx \nabla \ln q</math>. This is '''score matching'''''.''<ref>{{Cite web |title=Sliced Score Matching: A Scalable Approach to Density and Score Estimation {{!}} Yang Song |url=https://yang-song.net/blog/2019/ssm/ |access-date=2023-09-24 |website=yang-song.net}}</ref> Typically, score matching is formalized as minimizing '''Fisher divergence''' function <math>E_q[\|f_\theta(x) - \nabla \ln q(x)\|^2]</math>. By expanding the integral, and performing an integration by parts, <math display="block">E_q[\|f_\theta(x) - \nabla \ln q(x)\|^2] = E_q[\|f_\theta\|^2 + 2\nabla
==== Annealing the score function ====
Line 130 ⟶ 131:
==== Backward diffusion process ====
If we have solved <math>\rho_t</math> for time <math>t\in [0, T]</math>, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density <math>\nu_0 = \rho_T</math>, and let the particles in the cloud evolve according to
<math display="block">dy_t = \frac{1}{2} \beta(T-t) y_{t} d t + \beta(T-t) \underbrace{\nabla_{y_{t}} \ln \rho_{T-t}\left(y_{t}\right)}_{\text {score function }} d t+\sqrt{\beta(T-t)} d W_t</math> then by plugging into the Fokker-Planck equation, we find that <math>\partial_t \rho_{T-t} = \partial_t \nu_t</math>. Thus this cloud of points is the original cloud, evolving backwards.<ref>{{Cite journal |last=Anderson |first=Brian D.O. |date=May 1982 |title=Reverse-time diffusion equation models |url=http://dx.doi.org/10.1016/0304-4149(82)90051-5 |journal=Stochastic Processes and Their Applications |volume=12 |issue=3 |pages=313–326 |doi=10.1016/0304-4149(82)90051-5 |issn=0304-4149|url-access=subscription }}</ref> === Noise conditional score network (NCSN) ===
Line 260 ⟶ 265:
=== Other examples ===
Notable variants include<ref>{{Cite journal |last1=Cao |first1=Hanqun |last2=Tan |first2=Cheng |last3=Gao |first3=Zhangyang |last4=Xu |first4=Yilun |last5=Chen |first5=Guangyong |last6=Heng |first6=Pheng-Ann |last7=Li |first7=Stan Z. |date=July 2024 |title=A Survey on Generative Diffusion Models
== Flow-based diffusion model ==
Line 305 ⟶ 310:
<math display="block">\min_{\theta} \int_0^1 \mathbb{E}_{\pi_0, \pi_1, p_t}\left [\lVert{(x_1-x_0) - v_t(x_t)}\rVert^2\right] \,\mathrm{d}t.</math>
The data pair <math>(x_0, x_1)</math> can be any coupling of <math>\pi_0</math> and <math>\pi_1</math>, typically independent (i.e., <math>(x_0,x_1) \sim \pi_0 \times \pi_1</math>) obtained by randomly combining observations from <math>\pi_0</math> and <math>\pi_1</math>. This process ensures that the trajectories closely mirror the density map of <math>x_t</math> trajectories but ''reroute'' at intersections to ensure causality.
[[File:Reflow Illustration.png|thumb|390px|The reflow process<ref name=":0"/>]]
Line 312 ⟶ 317:
Rectified flow includes a nonlinear extension where linear interpolation <math>x_t</math> is replaced with any time-differentiable curve that connects <math>x_0</math> and <math>x_1</math>, given by <math>x_t = \alpha_t x_1 + \beta_t x_0</math>. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of <math>\alpha_t</math> and <math>\beta_t</math>. However, in the case where the path of <math>x_t</math> is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of <math>\phi_t</math>.<ref name=":0" />
== Choice of architecture ==
Line 402 ⟶ 405:
** {{Cite journal |last1=Yang |first1=Ling |last2=Zhang |first2=Zhilong |last3=Song |first3=Yang |last4=Hong |first4=Shenda |last5=Xu |first5=Runsheng |last6=Zhao |first6=Yue |last7=Zhang |first7=Wentao |last8=Cui |first8=Bin |last9=Yang |first9=Ming-Hsuan |date=2023-11-09 |title=Diffusion Models: A Comprehensive Survey of Methods and Applications |url=https://dl.acm.org/doi/abs/10.1145/3626235 |journal=ACM Comput. Surv. |volume=56 |issue=4 |pages=105:1–105:39 |doi=10.1145/3626235 |issn=0360-0300|arxiv=2209.00796 }}
** {{ Cite arXiv | eprint=2107.03006 | last1=Austin | first1=Jacob | last2=Johnson | first2=Daniel D. | last3=Ho | first3=Jonathan | last4=Tarlow | first4=Daniel | author5=Rianne van den Berg | title=Structured Denoising Diffusion Models in Discrete State-Spaces | date=2021 | class=cs.LG }}
** {{Cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023-09-01 |title=Diffusion Models in Vision: A Survey
* Mathematical details omitted in the article.
** {{Cite web |date=2022-09-25 |title=Power of Diffusion Models |url=https://astralord.github.io/posts/power-of-diffusion-models/ |access-date=2023-09-25 |website=AstraBlog |language=en}}
Line 416 ⟶ 419:
[[Category:Markov models]]
[[Category:Machine learning algorithms]]
__FORCETOC__
|