Diffusion model: Difference between revisions

Content deleted Content added
sure, but let's not imply that the sampling is something that the model itself is doing
The detailed and professional elaboration on the rectified flow has been updated.
Tags: Visual edit Mobile edit Mobile web edit
Line 173:
 
=== Rectified flow ===
Rectified flow is a method for learning transport maps between two distributions, offers a new perspective on understanding diffusion models and their ODE variants. Distinct from the complex SDE models, rectified flow is purely ODE-based, offering a straightforward and unified framework for generative and transfer modeling. Given the infinite possibilities of ODEs/SDEs to transfer data between two distributions, rectified flow specifically advocates for ODEs with solution paths that are straight lines. By learning straight flows, it provides a principled approach to learning ODEs with fast inference, effectively training one-step models with ODEs as intermediate steps. Rectified Flow (RF) formulation is employed in the Stable Diffusion 3. <ref>{{Cite web |title=Stable Diffusion 3: Research Paper |url=https://stability.ai/news/stable-diffusion-3-research-paper |access-date=2024-03-22 |website=Stability AI |language=en-GB}}</ref>
In standard diffusion modeling, the forward process turns the dataset distribution into white noise by adding a little bit of white noise at a time, and the backward process turns white noise back to the dataset distribution by removing a little bit of white noise at a time.
 
If the forward process is "nice", then the backward process might also be nice. The rectified flow is a certain nice forward process, and it is used in Stable Diffusion 3.0.
 
Rescale the time-interval to <math>[0, 1]</math>, then let the starting point be <math>x_0</math>, an image sampled from the natural image distribution. A forward process neural network is a function <math>v(x_t, t)</math>, such that integrating<math display="block">dx_t = v(x_t, t) dt</math>would give us <math>x_1</math>, a white-noise image.
 
Given two distributions <math>\pi_0</math> and <math>\pi_1</math>, probability flow models implicitly learns the transport map by constructing an ODE driven by a drift force in <math>\mathbb R^d \times [0,1]</math>: <math display="block">\mathrm d \mathbf Z_t = \mathbf v(\mathbf Z_t , t) \, \mathrm dt, \quad t \in [0,1], \quad \text{starting from }\mathbf Z_0 \sim \mathbf\pi_0</math> such that <math>\mathbf Z_1 \sim \pi_1</math> when following the ODE starting from <math>\mathbf Z_0 \sim \pi_0</math>. Generally, for any time-differentiable process <math>\mathbf X(t)</math>, <math>\mathbf v</math> can be estimated by solving: <math display="block">\min_{\mathbf v} \int_0^1 \mathbb{E}\left [\lVert{\dot{\mathbf X}_t - \mathbf v(\mathbf X_t, t)}\rVert^2\right] \,\mathrm{d}t.</math>
 
By injecting strong priors that intermediate trajectories are straight in rectified flow, it can achieve both theoretical relevance for optimal transport <ref>{{Citation |last=Liu |first=Qiang |title=Rectified Flow: A Marginal Preserving Approach to Optimal Transport |date=2022-09-29 |url=http://arxiv.org/abs/2209.14577 |access-date=2024-03-22 |doi=10.48550/arXiv.2209.14577}}</ref> computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.
Given a probabilistic coupling over <math>(x_0, x_1) </math>, we can train a new velocity field <math>v_\theta</math> on the space of all images by minimizing the expectation of <math>\int_0^1 \|(x_1 - x_0 ) - v_\theta(x_t, t) \|^2 dt</math>. Intuitively, this means that the velocity field attempts to guide each noisy image to a natural-looking image on a path as straight as possible, except those points where it is very ambiguous, at which point the image is guided to the average of several natural-looking images that it might end up with.
 
NowSpecifically, since a velocity field defines another probabilistic coupling, by integrating the velocity field into a path, we can iterate the rectified flow operation.seeks Eventually,to wematch wouldan obtainODE a probabilistic coupling between noise and natural-looking images that are connected by mostly-straight paths in a vector field. This then gives us a particularly nice forward process. We can then train a backward process neural network bywith the samemarginal denoisingdistributions loss function. Sinceof the forward'''linear processinterpolation''' consistsbetween ofpoints ratherfrom straightdistributions paths, instead of jagged Brownian motion paths, the backward process can be nicer<math>\pi_0</math> and can handle large steps in Euler sampling.<math>\pi_1</math><ref name=":70">{{Citation |last=Liu |first=Xingchao |title=Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow |date=2022-09-07 |url=http://arxiv.org/abs/2209.03003 |access-date=2024-03-0622 |doi=10.48550/arXiv.2209.03003 |last2=Gong |first2=Chengyue |last3=Liu |first3=Qiang}}</ref>. Given observations <refmath>\mathbf{X}_0 name=":8"\sim \pi_0</math> and <math>\mathbf{{CiteX}_1 web\sim |title\pi_1</math>, the canonical linear interpolation <math>\mathbf{X}_t=Rectifiedt\mathbf{X}_1 Flow+ (1-t)\mathbf{X}_0, Rectifiedt\in Flow[0,1]</math> yields a trivial case <math>\dot{\mathbf X}_t |url=https: \mathbf X_1 - \mathbf X_0</math>, which cannot be causally simulated without <math>\mathbf{X}_1</wwwmath>.cs.utexas.edu To address this, <math>\mathbf{X}_t</~lqiangmath> is "projected" into a space of causally simulatable ODEs, expressed as <math>\mathrm{d}\mathbf{Z}_t = \mathbf{v}(\mathbf{Z}_t, t)</rectflow/html/intro.htmlmath>, by minimizing the least squares loss with respect to the direction <math>\mathbf{X}_1 |access-date \mathbf{X}_0</math>: <math display=2024"block">\min_{\mathbf v} \int_0^1 \mathbb{E}\left [\lVert{(\mathbf X_1-03\mathbf X_0) -06 |website=www.cs.utexas.edu\mathbf v(\mathbf X_t, t)}\rVert^2\right] \,\mathrm{d}t.</refmath>
 
The data pair <math>(\mathbf{X}_0, \mathbf{X}_1)</math> can be any coupling of <math>\pi_0</math> and <math>\pi_1</math>, typically independent (i.e., <math>(\mathbf{X}_0,\mathbf{X}_1) \sim \pi_0 \times \pi_1</math>) obtained by randomly combining observations from <math>\pi_0</math> and <math>\pi_1</math>. This process ensures that the <math>\mathbf{Z}_t</math> trajectories closely mirror the density map of <math>\mathbf{X}_t</math> trajectories but ''reroute'' at intersections to ensure causality. This rectifying process is also referred to as Flow Matching, Stochastic Interpolation, and Alpha-Blending.
 
A distinctive aspect of rectified flow is its capability for "'''reflow'''", which straightens the trajectory of ODE paths. Denote the rectified flow <math>\boldsymbol{Z}^0 = \{\mathbf{Z}_t: t\in[0,1]\}</math> induced from <math>(\mathbf{X}_0,\mathbf{X}_1)</math> as <math>\boldsymbol{Z}^0 = \mathsf{Rectflow}((\mathbf{X}_0,\mathbf{X}_1))</math>. Recursively applying this <math>\mathsf{Rectflow}(\cdot)</math> operator generates a series of rectified flows <math>\boldsymbol{Z}^{k+1} = \mathsf{Rectflow}((\mathbf{Z}_0^k, \mathbf{Z}_1^k))</math>, starting with <math>(\mathbf{Z}_0^0,\mathbf{Z}_1^0)=(\mathbf{X}_0,\mathbf{X}_1)</math>, where <math>\boldsymbol{Z}^k</math> is the <math>k</math>-th iteration of rectified flow induced from <math>(\mathbf{X}_0,\mathbf{X}_1)</math>. This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making <math>\boldsymbol{Z}^k</math> paths straighter with increasing <math>k</math>.
 
Rectified flow includes a nonlinear extension where linear interpolation <math>\mathbf{X}_t</math> is replaced with any time-differentiable curve that connects <math>\mathbf{X}_0</math> and <math>\mathbf{X}_1</math>, given by <math>\mathbf{X}_t = \alpha_t \mathbf{X}_1 + \beta_t \mathbf{X}_0</math>. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of <math>\alpha_t</math> and <math>\beta_t</math>. However, in the case when the path of <math>\mathbf{X}</math> is not straight, the pair <math>(\mathbf{Z}_0, \mathbf{Z}_1)</math> no longer ensures a reduction in convex transport costs, and the reflow process also no longer straighten the paths of <math>\mathbf{Z}_t</math> <ref name=":0" />.
 
Flows with nearly straight paths offer a significant computational benefit by minimizing time-discretization errors in numerical simulations. Specifically, if an ODE <math>\mathrm{d} \mathbf{Z}_t = \mathbf{v}(\mathbf{Z}_t,t)\; \mathrm{d}t</math> follows perfectly straight paths, it simplifies to <math>\mathbf{Z}_t = \mathbf{Z}_0 + t \cdot \mathbf{v}(\mathbf{Z}_0, 0)</math>, allowing for exact solutions with just ''one single Euler step''. This addresses the very bottleneck of slow inference in ODE/SDE models. Consequently, the reflow/straightening procedure emerges as a unique strategy for training one-step generative models, such as GANs and VAEs, using ODEs as an intermediate mechanism.
 
== Choice of architecture ==