Diffusion model: Difference between revisions

Content deleted Content added
m replaced: naturally-occurring → naturally occurring (3)
 
(One intermediate revision by one other user not shown)
Line 4:
In [[machine learning]], '''diffusion models''', also known as '''diffusion-based generative models''' or '''score-based generative models''', are a class of [[latent variable model|latent variable]] [[generative model|generative]] models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a [[diffusion process]] for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a [[Wiener process|random walk with drift]] through the space of all possible data.<ref name="song"/> A trained diffusion model can be sampled in many ways, with different efficiency and quality.
 
There are various equivalent formalisms, including [[Markov chain]]s, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.<ref>{{cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023 |title=Diffusion Models in Vision: A Survey |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=9 |pages=10850–10869 |arxiv=2209.04747 |doi=10.1109/TPAMI.2023.3261988 |pmid=37030794 |bibcode=2023ITPAM..4510850C |s2cid=252199918}}</ref> They are typically trained using [[Variational Bayesian methods|variational inference]].<ref name="ho" /> The model responsible for denoising is typically called its "[[#Choice of architecture|backbone]]". The backbone may be of any kind, but they are typically [[U-Net|U-nets]] or [[Transformer (deep learning architecture)|transformers]].
 
{{As of|2024}}, diffusion models are mainly used for [[computer vision]] tasks, including [[image denoising]], [[inpainting]], [[super-resolution]], [[text-to-image model|image generation]], and video generation. These typically involve training a neural network to sequentially [[denoise]] images blurred with [[Gaussian noise]].<ref name="song">{{Cite arXiv |last1=Song |first1=Yang |last2=Sohl-Dickstein |first2=Jascha |last3=Kingma |first3=Diederik P. |last4=Kumar |first4=Abhishek |last5=Ermon |first5=Stefano |last6=Poole |first6=Ben |date=2021-02-10 |title=Score-Based Generative Modeling through Stochastic Differential Equations |class=cs.LG |eprint=2011.13456 }}</ref><ref name="gu">{{cite arXiv |last1=Gu |first1=Shuyang |last2=Chen |first2=Dong |last3=Bao |first3=Jianmin |last4=Wen |first4=Fang |last5=Zhang |first5=Bo |last6=Chen |first6=Dongdong |last7=Yuan |first7=Lu |last8=Guo |first8=Baining |title=Vector Quantized Diffusion Model for Text-to-Image Synthesis |date=2021 |class=cs.CV |eprint=2111.14822}}</ref> The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.
Line 10:
Diffusion-based image generators have seen widespread commercial interest, such as [[Stable Diffusion]] and [[DALL-E]]. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.<ref name="dalle2" />
 
Other than computer vision, diffusion models have also found applications in [[natural language processing]]<ref>{{ Cite arXiv |eprint=2410.18514 |last1=Nie |first1=Shen |last2=Zhu |first2=Fengqi |last3=Du |first3=Chao |last4=Pang |first4=Tianyu |last5=Liu |first5=Qian |last6=Zeng |first6=Guangtao |last7=Lin |first7=Min |last8=Li |first8=Chongxuan |title=Scaling up Masked Diffusion Models on Text |date=2024 |class=cs.AI }}</ref><ref>{{ Cite book |last1=Li |first1=Yifan |last2=Zhou |first2=Kun |last3=Zhao |first3=Wayne Xin |last4=Wen |first4=Ji-Rong |chapter=Diffusion Models for Non-autoregressive Text Generation: A Survey |date=August 2023 |pages=6692–6701 |title=Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence |chapter-url=http://dx.doi.org/10.24963/ijcai.2023/750 |___location=California |publisher=International Joint Conferences on Artificial Intelligence Organization |doi=10.24963/ijcai.2023/750|arxiv=2303.06574 |isbn=978-1-956792-03-4 }}</ref> such as [[Natural language generation|text generation]]<ref>{{Cite journal |last1=Han |first1=Xiaochuang |last2=Kumar |first2=Sachin |last3=Tsvetkov |first3=Yulia |date=2023 |title=SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control |url=http://dx.doi.org/10.18653/v1/2023.acl-long.647 |journal=Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |pages=11575–11596 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.acl-long.647|arxiv=2210.17432 }}</ref><ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Hu |first2=Wenxiang |last3=Wu |first3=Fanyou |last4=Sengamedu |first4=Srinivasan |date=2023 |title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM |url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 |journal=Findings of the Association for Computational Linguistics: EMNLP 2023 |pages=9040–9057 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-emnlp.606|arxiv=2310.15296 }}</ref> and [[Automatic summarization|summarization]],<ref>{{Cite journal |last1=Zhang |first1=Haopeng |last2=Liu |first2=Xiao |last3=Zhang |first3=Jiawei |date=2023 |title=DiffuSum: Generation Enhanced Extractive Summarization with Diffusion |url=http://dx.doi.org/10.18653/v1/2023.findings-acl.828 |journal=Findings of the Association for Computational Linguistics: ACL 2023 |pages=13089–13100 |___location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-acl.828|arxiv=2305.01735 }}</ref> sound generation,<ref>{{Cite journal |last1=Yang |first1=Dongchao |last2=Yu |first2=Jianwei |last3=Wang |first3=Helin |last4=Wang |first4=Wen |last5=Weng |first5=Chao |last6=Zou |first6=Yuexian |last7=Yu |first7=Dong |date=2023 |title=Diffsound: Discrete Diffusion Model for Text-to-Sound Generation |url=http://dx.doi.org/10.1109/taslp.2023.3268730 |journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing |volume=31 |pages=1720–1733 |doi=10.1109/taslp.2023.3268730 |issn=2329-9290|arxiv=2207.09983 |bibcode=2023ITASL..31.1720Y }}</ref> and reinforcement learning.<ref>{{cite arXiv |last1=Janner |first1=Michael |title=Planning with Diffusion for Flexible Behavior Synthesis |date=2022-12-20 |eprint=2205.09991 |last2=Du |first2=Yilun |last3=Tenenbaum |first3=Joshua B. |last4=Levine |first4=Sergey|class=cs.LG }}</ref><ref>{{cite arXiv |last1=Chi |first1=Cheng |title=Diffusion Policy: Visuomotor Policy Learning via Action Diffusion |date=2024-03-14 |eprint=2303.04137 |last2=Xu |first2=Zhenjia |last3=Feng |first3=Siyuan |last4=Cousineau |first4=Eric |last5=Du |first5=Yilun |last6=Burchfiel |first6=Benjamin |last7=Tedrake |first7=Russ |last8=Song |first8=Shuran|class=cs.RO }}</ref>
 
== Denoising diffusion model ==
Line 24:
The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by [[Variational Bayesian methods|variational inference]].<ref name="ho">{{Cite journal |last1=Ho |first1=Jonathan |last2=Jain |first2=Ajay |last3=Abbeel |first3=Pieter |date=2020 |title=Denoising Diffusion Probabilistic Models |url=https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=6840–6851}}</ref><ref>{{Citation |last=Ho |first=Jonathan |title=hojonathanho/diffusion |date=Jun 20, 2020 |url=https://github.com/hojonathanho/diffusion |access-date=2024-09-07}}</ref>
 
====
==== Forward diffusion ====
To present the model, we need some notation.
 
* <math>\beta_1, ..., \beta_T \in (0, 1)</math> are fixed constants.
Line 265 ⟶ 264:
 
=== Other examples ===
Notable variants include<ref>{{Cite journal |last1=Cao |first1=Hanqun |last2=Tan |first2=Cheng |last3=Gao |first3=Zhangyang |last4=Xu |first4=Yilun |last5=Chen |first5=Guangyong |last6=Heng |first6=Pheng-Ann |last7=Li |first7=Stan Z. |date=July 2024 |title=A Survey on Generative Diffusion Models |url=https://ieeexplore.ieee.org/document/10419041 |journal=IEEE Transactions on Knowledge and Data Engineering |volume=36 |issue=7 |pages=2814–2830 |doi=10.1109/TKDE.2024.3361474 |bibcode=2024ITKDE..36.2814C |issn=1041-4347|url-access=subscription }}</ref> Poisson flow generative model,<ref>{{Cite journal |last1=Xu |first1=Yilun |last2=Liu |first2=Ziming |last3=Tian |first3=Yonglong |last4=Tong |first4=Shangyuan |last5=Tegmark |first5=Max |last6=Jaakkola |first6=Tommi |date=2023-07-03 |title=PFGM++: Unlocking the Potential of Physics-Inspired Generative Models |url=https://proceedings.mlr.press/v202/xu23m.html |journal=Proceedings of the 40th International Conference on Machine Learning |language=en |publisher=PMLR |pages=38566–38591|arxiv=2302.04265 }}</ref> consistency model,<ref>{{Cite journal |last1=Song |first1=Yang |last2=Dhariwal |first2=Prafulla |last3=Chen |first3=Mark |last4=Sutskever |first4=Ilya |date=2023-07-03 |title=Consistency Models |url=https://proceedings.mlr.press/v202/song23a |journal=Proceedings of the 40th International Conference on Machine Learning |language=en |publisher=PMLR |pages=32211–32252}}</ref> critically damped Langevin diffusion,<ref>{{Cite arXiv |last1=Dockhorn |first1=Tim |last2=Vahdat |first2=Arash |last3=Kreis |first3=Karsten |date=2021-10-06 |title=Score-Based Generative Modeling with Critically-Damped Langevin Diffusion |class=stat.ML |eprint=2112.07068 }}</ref> GenPhys,<ref>{{cite arXiv |last1=Liu |first1=Ziming |title=GenPhys: From Physical Processes to Generative Models |date=2023-04-05 |eprint=2304.02637 |last2=Luo |first2=Di |last3=Xu |first3=Yilun |last4=Jaakkola |first4=Tommi |last5=Tegmark |first5=Max|class=cs.LG }}</ref> cold diffusion,<ref>{{Cite journal |last1=Bansal |first1=Arpit |last2=Borgnia |first2=Eitan |last3=Chu |first3=Hong-Min |last4=Li |first4=Jie |last5=Kazemi |first5=Hamid |last6=Huang |first6=Furong |last7=Goldblum |first7=Micah |last8=Geiping |first8=Jonas |last9=Goldstein |first9=Tom |date=2023-12-15 |title=Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/80fe51a7d8d0c73ff7439c2a2554ed53-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=41259–41282|arxiv=2208.09392 }}</ref> discrete diffusion,<ref>{{Cite journal |last1=Gulrajani |first1=Ishaan |last2=Hashimoto |first2=Tatsunori B. |date=2023-12-15 |title=Likelihood-Based Diffusion Language Models |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/35b5c175e139bff5f22a5361270fce87-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=16693–16715|arxiv=2305.18619 }}</ref><ref>{{cite arXiv |last1=Lou |first1=Aaron |title=Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution |date=2024-06-06 |eprint=2310.16834 |last2=Meng |first2=Chenlin |last3=Ermon |first3=Stefano|class=stat.ML }}</ref> etc.
 
== Flow-based diffusion model ==
Line 405 ⟶ 404:
** {{Cite journal |last1=Yang |first1=Ling |last2=Zhang |first2=Zhilong |last3=Song |first3=Yang |last4=Hong |first4=Shenda |last5=Xu |first5=Runsheng |last6=Zhao |first6=Yue |last7=Zhang |first7=Wentao |last8=Cui |first8=Bin |last9=Yang |first9=Ming-Hsuan |date=2023-11-09 |title=Diffusion Models: A Comprehensive Survey of Methods and Applications |url=https://dl.acm.org/doi/abs/10.1145/3626235 |journal=ACM Comput. Surv. |volume=56 |issue=4 |pages=105:1–105:39 |doi=10.1145/3626235 |issn=0360-0300|arxiv=2209.00796 }}
** {{ Cite arXiv | eprint=2107.03006 | last1=Austin | first1=Jacob | last2=Johnson | first2=Daniel D. | last3=Ho | first3=Jonathan | last4=Tarlow | first4=Daniel | author5=Rianne van den Berg | title=Structured Denoising Diffusion Models in Discrete State-Spaces | date=2021 | class=cs.LG }}
** {{Cite journal |last1=Croitoru |first1=Florinel-Alin |last2=Hondru |first2=Vlad |last3=Ionescu |first3=Radu Tudor |last4=Shah |first4=Mubarak |date=2023-09-01 |title=Diffusion Models in Vision: A Survey |url=https://ieeexplore.ieee.org/document/10081412 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=9 |pages=10850–10869 |doi=10.1109/TPAMI.2023.3261988 |pmid=37030794 |issn=0162-8828|arxiv=2209.04747 |bibcode=2023ITPAM..4510850C }}
* Mathematical details omitted in the article.
** {{Cite web |date=2022-09-25 |title=Power of Diffusion Models |url=https://astralord.github.io/posts/power-of-diffusion-models/ |access-date=2023-09-25 |website=AstraBlog |language=en}}