Flow-based generative model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:53, 25 June 2025 edit ShelfSkewed (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers 308,202 edits →Differential volume ratio for curved manifolds: dab link ← Previous edit		Latest revision as of 19:21, 25 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,861,373 edits Added bibcode. Removed URL that duplicated identifier. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 918/1032
(8 intermediate revisions by 6 users not shown)
Line 10: == Method == [[File:Normalizing-flow.svg\|thumb\|Scheme for normalizing flows]] Let <math>z_0</math> be a (possibly multivariate) [[random variable]] with distribution <math>p_0(z_0)</math>. For <math>i = 1, ..., K</math>, let <math>z_i = f_i(z_{i-1})</math> be a sequence of random variables transformed from <math>z_0</math>. The functions <math>f_1, ..., f_K</math> should be invertible, i.e. the [[inverse function]] <math>f^{-1}_i</math> exists. The final output <math>z_K</math> models the target distribution. The log likelihood of <math>z_K</math> is (see [[#Derivation of log likelihood\|derivation]]): : <math>\log p_K(z_K) = \log p_0(z_0) - \sum_{i=1}^{K} \log \left\|\det \frac{df_i(z_{i-1})}{dz_{i-1}}\right\|</math> Learning probability distributions by differentiating log Jacobians originated in the Infomax (maximum likelihood) approach to ICA,<ref>Bell, A. J.; Sejnowski, T. J. (1995). "[https://doi.org/10.1162/neco.1995.7.6.1129 An information-maximization approach to blind separation and blind deconvolution]". ''Neural Computation''. 7 (6): 1129–1159. doi:10.1162/neco.1995.7.6.1129.</ref> which forms a single-layer (K=1) flow-based model. Relatedly, the single layer precursor of conditional generative flows appeared in <ref>Roth, Z.; Baram, Y. (1996). "[https://doi.org/10.1109/72.536322 Multidimensional density shaping by sigmoids]". ''IEEE Transactions on Neural Networks''. 7 (5): 1291–1298. doi:10.1109/72.536322.</ref>. To efficiently compute the log likelihood, the functions <math>f_1, ..., f_K</math> should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions <math>f_1, ..., f_K</math> are modeled using [[Deep learning\|deep neural networks]], and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,<ref name=":1">{{cite arXiv \| eprint=1410.8516\| last1=Dinh\| first1=Laurent\| last2=Krueger\| first2=David\| last3=Bengio\| first3=Yoshua\| title=NICE: Non-linear Independent Components Estimation\| year=2014\| class=cs.LG}}</ref> RealNVP,<ref name=":2">{{cite arXiv \| eprint=1605.08803\| last1=Dinh\| first1=Laurent\| last2=Sohl-Dickstein\| first2=Jascha\| last3=Bengio\| first3=Samy\| title=Density estimation using Real NVP\| year=2016\| class=cs.LG}}</ref> and Glow.<ref name="glow">{{cite arXiv \| eprint=1807.03039\| last1=Kingma\| first1=Diederik P.\| last2=Dhariwal\| first2=Prafulla\| title=Glow: Generative Flow with Invertible 1x1 Convolutions\| year=2018\| class=stat.ML}}</ref> Line 65 ⟶ 69: In other words, minimizing the [[Kullback–Leibler divergence]] between the model's likelihood and the target distribution is equivalent to [[Maximum likelihood estimation\|maximizing the model likelihood]] under observed samples of the target distribution.<ref>{{Cite journal \|last1=Papamakarios \|first1=George \|last2=Nalisnick \|first2=Eric \|last3=Rezende \|first3=Danilo Jimenez \|last4=Shakir \|first4=Mohamed \|last5=Balaji \|first5=Lakshminarayanan \|date=March 2021 \|title=Normalizing Flows for Probabilistic Modeling and Inference \|journal=Journal of Machine Learning Research \|url=https://jmlr.org/papers/v22/19-1028.html \|volume=22 \|issue=57 \|pages=1–64 \|arxiv=1912.02762}}</ref> A pseudocode for training normalizing flows is as follows:<ref>{{Cite journal \|last1=Kobyzev \|first1=Ivan \|last2=Prince \|first2=Simon J.D. \|last3=Brubaker \|first3=Marcus A. \|date=November 2021 \|title=Normalizing Flows: An Introduction and Review of Current Methods ~~\|url=https://ieeexplore.ieee.org/document/9089305~~ \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence \|volume=43 \|issue=11 \|pages=3964–3979 \|doi=10.1109/TPAMI.2020.2992934 \|pmid=32396070 \|arxiv=1908.09257 \|bibcode=2021ITPAM..43.3964K \|s2cid=208910764 \|issn=1939-3539}}</ref> * INPUT. dataset <math>x_{1:n}</math>, normalizing flow model <math>f_\theta(\cdot), p_0 </math>. Line 190 ⟶ 194: As running examples of manifolds with smooth, isometric embedding in <math>\R^n</math> we shall use: * The [[n-sphere\|unit hypersphere]]: <math>\mathbb S^{n-1}=\{\mathbf x\in\R^n:\mathbf x'\mathbf x=1\}</math>, where flows can be used to generalize e.g. [[Von Mises-Fisher distribution\|Von Mises-Fisher]] or uniform spherical distributions. * The [[simplex]] interior: <math>\Delta^{n-1}=\{\mathbf p=(p_1,\dots,p_n)\in\R^n:p_i>0, \sum_ip_i=1\}</math>, where <math>n</math>-way [[categorical distribution~~\|categorical distributions~~]]s live; and where flows can be used to generalize e.g. [[Dirichlet distribution\|Dirichlet]], or uniform simplex distributions. As a first example of a spherical manifold flow transform, consider the [[ACG distribution#ACG via transformation of normal or uniform variates\|normalized linear transform]], which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the <math>n\text{-by-}n</math> invertible matrix <math>\mathbf M</math>: Line 197 ⟶ 201: </math> In full Euclidean space, <math>f_\text{lin}:\R^n\to\R^n</math> is ''not'' invertible, but if we restrict the ___domain and co-___domain to the unitsphere, then <math>f_\text{lin}:\mathbb S^{n-1}\to\mathbb S^{n-1}</math> ''is'' invertible (more specifically it is a [[bijection]] and a [[homeomorphism]] and a [[diffeomorphism]]), with inverse <math>f_\text{lin}(\cdot\,;\mathbf M^{-1}) </math>. The Jacobian of <math>f_\text{lin}:\R^n\to\R^n</math>, at <math>\mathbf y=f_\text{lin}(\mathbf x;\mathbf M)</math> is <math>\lVert\mathbf{Mx}\rVert^{-1}(\mathbf I_n -\mathbf{yy}')\mathbf M</math>, which has rank <math>n-1</math> and determinant of zero; while [[~~Projected_normal_distribution~~Projected normal distribution#~~Wider_application_of_the_normalized_linear_transform~~Wider application of the normalized linear transform\|as explained here]], the factor (see subsection below) relating source and transformed densities is: <math>\lVert\mathbf{Mx}\rVert^{-n}\left\|\operatorname{det}\mathbf M\right\|</math>. === Differential volume ratio === Line 204 ⟶ 208: P_X(\mathbf x)\operatorname{volume}(U)\approx P_Y(\mathbf y)\operatorname{volume}(V) </math> where volume (for very small regions) is given by [[Lebesgue measure]] in <math>m</math>-dimensional [[tangent space]]. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the '''differential volume ratio'''. To obtain concrete formulas for volume on the <math>m</math>-dimensional manifold, we construct <math>U</math> by mapping an <math>m</math>-dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function: <math>\R^m\to\R^n</math>. At very small scale, the embedding function becomes essentially linear so that <math>U</math> is a [[Parallelepiped#Parallelotope\|parallelotope]] (multidimensional generalization of a parallelogram). Similarly, the flow transform, <math>f</math> becomes linear, so that the image, <math>V=f(U)</math> is also a parallelotope. In <math>\R^m</math>, we can represent an <math>m</math>-dimensional parallelotope with an <math>m\text{-by-}m</math> matrix whose ~~colum~~column-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The [[Determinant#~~Volume_and_Jacobian_determinant~~Volume and Jacobian determinant\|volume is given by the absolute value of the determinant]] of this matrix. If more generally (as is the case here), an <math>m</math>-dimensional paralellotope is embedded in <math>\R^n</math>, it can be represented with a (tall) <math>n\text{-by-}m</math> matrix, say <math>\mathbf V</math>. Denoting the parallelotope as <math>/\mathbf V\!/</math>, its volume is then given by the square root of the [[Gram determinant]]: :<math> \operatorname{volume}/\mathbf V\!/=\sqrt{\left\|\operatorname{det}(\mathbf V'\mathbf V)\right\|} Line 216 ⟶ 220: :<math>e:\tilde\mathbf p=(p_1\dots,p_{n-1})\mapsto\mathbf p=(p_1\dots,p_{n-1},1-\sum_{i=1}^{n-1}p_i) </math> which maps a conveniently chosen, <math>(n-1)</math>-dimensional ~~repesentation~~representation, <math>\tilde\mathbf p</math>, to the embedded manifold. The <math>n\text{-by-}(n-1)</math> Jacobian is <math>\mathbf E = \begin{bmatrix} \mathbf{I}_{n-1} \\ Line 229 ⟶ 233: =\sqrt n\prod_{i=1}^{n-1} \left\|dp_i\right\| </math> To understand the geometric interpretation of the factor <math>\sqrt{n}</math>, see the example for the 1-simplex in the diagram at right. The differential volume element at the transformation output (<math>\mathbf q\in\Delta^{n-1}</math>), is the parallelotope, <math>V=f(U)=/\mathbf{F_pED}\!/</math>, where <math>\mathbf{F_p}</math> is the <math>n\text{-by-}n</math> Jacobian of <math>f</math> at <math>\mathbf p=e(\tilde\mathbf p)</math>. Its volume is: Line 238 ⟶ 242: \left\|\operatorname{det}\mathbf D)\right\| </math> so that the factor <math>\left\|\operatorname{det}\mathbf D)\right\|</math> cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the '''representation function''', <math>r:\mathbf p\mapsto\tilde\mathbf p</math>, which simply extracts the first <math>(n-1)</math> components. The Jacobian is <math>\mathbf R=\begin{bmatrix}\mathbf I_n&\boldsymbol0\end{bmatrix}</math>. Observe that, since <math>e\circ r\circ f=f</math>, the [[~~Chain_rule~~Chain rule#~~General_rule~~General rule:~~_Vector~~ Vector-~~valued_functions_with_multiple_inputs~~valued functions with multiple inputs\|chain rule for function composition]] gives: <math>\mathbf{ERF_p}=\mathbf{F_p}</math>. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor <math>\sqrt{\left\|\operatorname{det}(\mathbf E'\mathbf E)\right\|}=\sqrt n</math>, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, <math>r\circ f\circ e</math>: :<math> R^\Delta_f(\mathbf p)=\frac{\operatorname{volume}(V)}{\operatorname{volume}(U)} Line 251 ⟶ 255: ====Simplex calibration transform==== A [[~~Dirichlet_distribution~~Dirichlet distribution#~~Generalization_by_scaling_and_translation_of_log~~Generalization by scaling and translation of log-probabilities\|calibration transform]], <math>f_\text{cal}:\Delta^{n-1}\to\Delta^{n-1}</math>, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic <math>n</math>-class classifier,<ref>{{Cite conference\|last1=Brümmer \|first1=Niko \|last2=van Leeuwen Line 280 ⟶ 284: </math> * This result can also be obtained by factoring the density of the [[SGB distribution]],<ref name="sgb">{{cite web \|last1=Graf \|first1=Monique (2019)\|title=The Simplicial Generalized Beta distribution - R-package SGB and applications \|url=https://libra.unine.ch/server/api/core/bitstreams/dd593778-b1fd-4856-855b-7b21e005ee77/content \|website=Libra \|access-date=26 May 2025}}</ref> which is obtained by sending [[Dirichlet distribution\|Dirichlet]] variates through <math>f_\text{cal}</math>. While calibration transforms are most often trained as [[discriminative model~~\|discriminative models~~]]s, the reinterpretation here as a probabilistic flow allows also the design of [[generative model\|generative]] calibration models based on this transform. When used for calibration, the restriction <math>a>0</math> can be imposed to prevent direction reversal in log-probability space. With the additional restriction <math>\mathbf c=\boldsymbol0</math>, this transform (with discriminative training) is known in machine learning as [[~~Platt_scaling~~Platt scaling#Analysis\|temperature scaling]]. ====Generalized calibration transform==== Line 311 ⟶ 315: If <math>f_\text{gcal}</math> is to be used as a calibration transform, further constraint could be imposed, for example that <math>\mathbf A</math> be [[positive definite matrix\|positive definite]], so that <math>(\mathbf{Ax})'\mathbf x>0</math>, which avoids direction reversals. (This is one possible generalization of <math>a>0</math> in the <math>f_\text{cal}</math> parameter.) For <math>n=2</math>, <math>a>0</math> and <math>\mathbf A</math> positive definite, then <math>f_\text{cal}</math> and <math>f_\text{gcal}</math> are equivalent in the sense that in both cases, <math>\log\frac{p_1}{p_2}\mapsto\log\frac{q_1}{q_2}</math> is a straight line, the (positive) slope and offset of which are functions of the transform parameters. For <math>n>2,</math> <math>f_\text{gcal}</math> ''does'' generalize <math>f_\text{cal}</math>. It must however be noted that chaining ~~mutliple~~multiple <math>f_\text{gcal}</math> flow transformations does ''not'' give a further generalization, because: :<math> f_\text{gcal}(\cdot\,;\mathbf A_1,\mathbf c_1) \circ Line 330 ⟶ 334: ===Differential volume ratio for curved manifolds=== Consider a flow, <math>\mathbf y=f(\mathbf x)</math> on a curved manifold, for example <math>\mathbb S^{n-1}</math> which we equip with the embedding function, <math>e</math> that maps a set of <math>(n-1)</math> [[N-sphere#~~Spherical_coordinates~~Spherical coordinates\|angular spherical coordinates]] to <math>\mathbb S^{n-1}</math>. The Jacobian of <math>e</math> is non-constant and we have to evaluate it at both input (<math>\mathbf {E_x}</math>) and output (<math>\mathbf {E_y}</math>). The same applies to <math>r</math>, the ~~represententation~~representation function that recovers spherical coordinates from points on <math>\mathbb S^{n-1}</math>, for which we need the Jacobian at the output (<math>\mathbf{R_y}</math>). The differential volume ratio now generalizes to: :<math> R_f(\mathbf x) = \left\|\operatorname{det}(\mathbf{R_yF_xE_x})\right\|\,\frac{\sqrt{\left\|\operatorname{det}(\mathbf E_\mathbf y'\mathbf{E_y})\right\|}}{\sqrt{\left\|\operatorname{det}(\mathbf E_\mathbf x'\mathbf{E_x})\right\|}} Line 336 ⟶ 340: For geometric insight, consider <math>\mathbf S^2</math>, where the spherical coordinates are co-latitude, <math>\theta\in[0,\pi]</math> and longitude, <math>\phi\in[0,2\pi)</math>. At <math>\mathbf x = e(\theta,\phi)</math>, we get <math>\sqrt{\left\|\operatorname{det}(\mathbf E_\mathbf x'\mathbf{E_x})\right\|}=\sin\theta</math>, which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is: <math>\sin\theta\,d\theta\,d\phi</math>. The above derivation for <math>R_f</math> is fragile in the sense that when using ''fixed'' functions <math>e,r</math>, there may be places where they are not well-defined, for example at the poles of the 2-sphere where longitude is arbitrary. This problem is sidestepped (using standard manifold machinery) by generalizing to ''local'' coordinates (charts), where in the vicinities of <math>\mathbf x,\mathbf y\in\mathcal M</math>, we map from local <math>m</math>-dimensional coordinates to <math>\R^n</math> and back using the respective function pairs <math>e_{\mathbf x}, r_{\mathbf x}</math> and <math>e_{\mathbf y}, r_{\mathbf y}</math>. We continue to use the same notation for the Jacobians of these functions (<math>\mathbf{E_x}, \mathbf{E_y}, \mathbf{R_y}</math>), so that the above formula for <math>R_f</math> remains valid. We ''can'' however, choose our local coordinate system in a way that simplifies the expression for <math>R_f</math> and indeed also its practical implementation.<ref name=manifold_flow/> Let <math>\pi:\mathcal P\to\R^n</math> be a smooth idempotent projection (<math>\pi\circ\pi=\pi</math>) from the ''projectible set'', <math>\mathcal P\subseteq\R^n</math>, onto the embedded manifold. For example: * The positive orthant of <math>\R^n</math> is projected onto the '''simplex''' as: <math>\pi(\mathbf z)=\bigl(\sum_{i=1}^n z_i\bigr)^{-1}\mathbf z</math> * Non-zero vectors in <math>\R^n</math> are projected onto the '''unitsphere''' as: <math>\pi(\mathbf z)=\bigl(\sum_{i=1}^n z^2_i\bigr)^{-\frac12}\mathbf z</math> For every <math>\mathbf x\in\mathcal M</math>, we require of <math>\pi</math> that its <math>n\text{-by-}n</math> Jacobian, <math>\boldsymbol{\Pi_x}</math> has rank <math>m</math> (the manifold dimension), in which case <math>\boldsymbol{\Pi_x}</math> is an [[projection (linear algebra)\|idempotent linear projection]] onto the local tangent space (''orthogonal'' for the unitsphere: <math>\mathbf I_n-\mathbf{xx}'</math>; ''oblique'' for the simplex: <math>\mathbf I_n-\boldsymbol{x1}'</math>). The ~~colums~~columns of <math>\boldsymbol{\Pi_x}</math> span the <math>m</math>-dimensional tangent space at <math>\mathbf x</math>. We use the notation, <math>\mathbf{T_x}</math> for any <math>n\text{-by-}m</math> matrix with orthonormal columns (<math>\mathbf T_{\mathbf x}'\mathbf{T_x}=\mathbf I_m</math>) that span the local tangent space. Also note: <math>\boldsymbol{\Pi_x}\mathbf{T_x}=\mathbf{T_x}</math>. We can now choose our local coordinate embedding function, <math>e_\mathbf x:\R^m\to\R^n</math>: :<math> e_\mathbf x(\tilde x) = \pi(\mathbf x + \mathbf{T_x\tilde x})\,, Line 355 ⟶ 359: R_f(\mathbf x) = \left\|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right\| </math> ====Practical implementation==== For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, <math>R_f</math>, or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to <math>R_f</math> itself. Practical solutions include: Sorrenson et al.(2023)<ref name=manifold_flow/> give a solution for computationally efficient stochastic parameter gradient approximation for <math>\log R_f.</math> For some hand-designed flow transforms, <math>R_f</math> can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. ~~Futher~~Further examples are given below in the section on simple spherical flows. On a software platform equipped with [[linear algebra]] and [[automatic differentiation]], <math>R_f(\mathbf x) = \left\|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right\|</math> can be automatically evaluated, given access to only <math>\mathbf x, f, \pi</math>.<ref>With [[PyTorch]]: <pre> Line 374 ⟶ 379: To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility. * To derive in simple closed form the '''differential volume ratio''', <math>R_f</math>. An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps ~~suprisingly~~surprisingly flexible distributions. ==== Normalized translation flow ====