Flow-based generative model: Difference between revisions

Content deleted Content added
a one word alteration for clarity
 
(97 intermediate revisions by 10 users not shown)
Line 10:
== Method ==
 
[[File:Normalizing-flow.svg|thumb|Scheme for normalizing flows]]
 
Let <math>z_0</math> be a (possibly multivariate) [[random variable]] with distribution <math>p_0(z_0)</math>.
 
For <math>i = 1, ..., K</math>, let <math>z_i = f_i(z_{i-1})</math> be a sequence of random variables transformed from <math>z_0</math>. The functions <math>f_1, ..., f_K</math> should be invertible, i.e. the [[inverse function]] <math>f^{-1}_i</math> exists. The final output <math>z_K</math> models the target distribution.
 
 
The log likelihood of <math>z_K</math> is (see [[#Derivation of log likelihood|derivation]]):
 
: <math>\log p_K(z_K) = \log p_0(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{df_i(z_{i-1})}{dz_{i-1}}\right|</math>
 
 
Learning probability distributions by differentiating such log Jacobians originated in the Infomax (maximum likelihood) approach to ICA,<ref>Bell, A. J.; Sejnowski, T. J. (1995). "[https://doi.org/10.1162/neco.1995.7.6.1129 An information-maximization approach to blind separation and blind deconvolution]". ''Neural Computation''. **7** (6): 1129–1159. doi:10.1162/neco.1995.7.6.1129.</ref> which forms a single-layer (K=1) flow-based model. Relatedly, the single layer precursor of conditional generative flows appeared in <ref>Roth, Z.; Baram, Y. (1996). "[https://doi.org/10.1109/72.536322 Multidimensional density shaping by sigmoids]". ''IEEE Transactions on Neural Networks''. **7** (5): 1291–1298. doi:10.1109/72.536322.</ref>.
 
To efficiently compute the log likelihood, the functions <math>f_1, ..., f_K</math> should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions <math>f_1, ..., f_K</math> are modeled using [[Deep learning|deep neural networks]], and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,<ref name=":1">{{cite arXiv | eprint=1410.8516| last1=Dinh| first1=Laurent| last2=Krueger| first2=David| last3=Bengio| first3=Yoshua| title=NICE: Non-linear Independent Components Estimation| year=2014| class=cs.LG}}</ref> RealNVP,<ref name=":2">{{cite arXiv | eprint=1605.08803| last1=Dinh| first1=Laurent| last2=Sohl-Dickstein| first2=Jascha| last3=Bengio| first3=Samy| title=Density estimation using Real NVP| year=2016| class=cs.LG}}</ref> and Glow.<ref name="glow">{{cite arXiv | eprint=1807.03039| last1=Kingma| first1=Diederik P.| last2=Dhariwal| first2=Prafulla| title=Glow: Generative Flow with Invertible 1x1 Convolutions| year=2018| class=stat.ML}}</ref>
Line 65 ⟶ 69:
In other words, minimizing the [[Kullback–Leibler divergence]] between the model's likelihood and the target distribution is equivalent to [[Maximum likelihood estimation|maximizing the model likelihood]] under observed samples of the target distribution.<ref>{{Cite journal |last1=Papamakarios |first1=George |last2=Nalisnick |first2=Eric |last3=Rezende |first3=Danilo Jimenez |last4=Shakir |first4=Mohamed |last5=Balaji |first5=Lakshminarayanan |date=March 2021 |title=Normalizing Flows for Probabilistic Modeling and Inference |journal=Journal of Machine Learning Research |url=https://jmlr.org/papers/v22/19-1028.html |volume=22 |issue=57 |pages=1–64 |arxiv=1912.02762}}</ref>
 
A pseudocode for training normalizing flows is as follows:<ref>{{Cite journal |last1=Kobyzev |first1=Ivan |last2=Prince |first2=Simon J.D. |last3=Brubaker |first3=Marcus A. |date=November 2021 |title=Normalizing Flows: An Introduction and Review of Current Methods |url=https://ieeexplore.ieee.org/document/9089305 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=43 |issue=11 |pages=3964–3979 |doi=10.1109/TPAMI.2020.2992934 |pmid=32396070 |arxiv=1908.09257 |bibcode=2021ITPAM..43.3964K |s2cid=208910764 |issn=1939-3539}}</ref>
 
* INPUT. dataset <math>x_{1:n}</math>, normalizing flow model <math>f_\theta(\cdot), p_0 </math>.
Line 171 ⟶ 175:
 
== Flows on manifolds ==
When a '''probabilistic flow''' transforms a distribution on an <math>m</math>-dimensional [[smooth manifold]] embedded in <math>\R^n</math>, where <math>m<n</math>, and where the transformation is specified as a function, <math>\R^n\to\R^n</math>, the scaling factor between the source and transformed [[probability density|PDFs]] is ''not'' given by the naive computation of the [[Jacobian matrix and determinant|determinant of the <math>n\text{-by-}n</math> Jacobian]] (which is zero), but instead by the determinant(s) of one or more suitably defined <math>m\text{-by-}m</math> matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),<ref name='manifold_flow'>{{Cite arXiv
|last1=Sorrenson
|first1=Peter
Line 185 ⟶ 189:
|eprint=2312.09852
|year=2023
|class=cs.LG
}}</ref> where the more general case of non-isometrically embedded [[Riemannian manifold|Riemann manifolds]] is also treated. Here we restrict attention to [[isometry|isometrically]] embedded manifolds.
}}</ref> where the more general case of non-isometrically embedded [[Riemannian manifold|Riemann manifolds]] is also treated. Here we restrict attention to [[isometry|isometrically]] embedded manifolds.
 
As running examples of manifolds with smooth, isometric embedding in <math>\R^n</math> we shall use:
* The [[n-sphere|unit hypersphere]]: <math>\mathbb S^{n-1}=\{\mathbf x\in\R^n:\mathbf x'\mathbf x=1\}</math>, where flows can be used to generalize e.g. [[Von Mises-Fisher distribution|Von Mises-Fisher]] or uniform spherical distributions.
* The [[simplex]] interior: <math>\Delta^{n-1}=\{\mathbf p=(p_1,\dots,p_n)\in\R^n:p_i>0, \sum_ip_i=1\}</math>, where <math>n</math>-way [[categorical distribution|categorical distributions]]s live; and where flows can be used to generalize e.g. [[Dirichlet distribution|Dirichlet]], or uniform simplex distributions.
 
As a first example of a spherical manifold flow transform, consider the [[ACG distribution#ACG via transformation of normal or uniform variates|normalized linear transform]], which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the <math>n\text{-by-}n</math> invertible matrix <math>\mathbf M</math>:
Line 196 ⟶ 201:
</math>
In full Euclidean space, <math>f_\text{lin}:\R^n\to\R^n</math> is ''not'' invertible, but if we restrict the ___domain and co-___domain to the unitsphere, then <math>f_\text{lin}:\mathbb S^{n-1}\to\mathbb S^{n-1}</math> ''is'' invertible (more specifically it is a [[bijection]] and a [[homeomorphism]] and a [[diffeomorphism]]), with inverse <math>f_\text{lin}(\cdot\,;\mathbf M^{-1})
</math>. The Jacobian of <math>f_\text{lin}:\R^n\to\R^n</math>, at <math>\mathbf y=f_\text{lin}(\mathbf x;\mathbf M)</math> is <math>\lVert\mathbf{Mx}\rVert^{-1}(\mathbf I_n -\mathbf{yy}')\mathbf M</math>, which has rank <math>n-1</math> and determinant of zero; while [[Projected_normal_distributionProjected normal distribution#Wider_application_of_the_normalized_linear_transformWider application of the normalized linear transform|as explained here]], the factor (see subsection below) relating source and transformed densities is: <math>\lVert\mathbf{Mx}\rVert^{-n}\left|\operatorname{det}\mathbf M\right|</math>.
 
=== Differential volume ratio ===
For <math>m<n</math>, let <math>\mathcal M\subset\R^n</math> be an <math>m</math>-dimensional manifold with a smooth, isometric embedding ininto <math>\R^n</math>. Let <math>f:\R^n\to\R^n</math> be a smooth flow transform with range restricted to <math>\mathcal M</math>. Let <math>\mathbf x\in\mathcal M</math> be sampled from a distribution with density <math>P_X</math>. Let <math>\mathbf y=f(\mathbf x)</math>, with resultant (pushforward) density <math>P_Y</math>. Let <math>\mathcal U, V\subset\mathcal M</math> be a small, convex regionsregion containing respectively <math>\mathbf x</math> and let <math>V=f(U)</math> be its image, which contains <math>\mathbf y</math>,; then by conservation of probability mass:
:<math>
P_X(\mathbf x)\operatorname{volume}(U)\approx P_Y(\mathbf y)\operatorname{volume}(V)
</math>
where volume (for very small regions) is given by [[Lebesgue measure]] in <math>m</math>-dimensional [[tangent space]]. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the '''differential volume ratio'''.
 
To obtain concrete formulas for volume on the <math>m</math>-dimensional manifold, we letconstruct <math>U</math> beby mapping an <math>m</math>-dimensional rectangle, orin (local) coordinate space to the manifold via a linearlysmooth transformedembedding rectangefunction: <math>\R^m\to\R^n</math>. At very small scale, whichthe embedding function becomes essentially linear so that <math>U</math> is a [[Parallelepiped#Parallelotope|parallelotope]] (multidimensional generalization of a parallelogram). ForSimilarly, verythe smallflow regionstransform, <math>f</math> becomes essentially linear, so that the image, <math>V=f(U)</math> is also a parallelotope. In <math>\R^m</math>, we can represent an <math>m</math>-dimensional parallelotope with an <math>m\text{-by-}m</math> matrix whose columcolumn-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The [[Determinant#Volume_and_Jacobian_determinantVolume and Jacobian determinant|volume is given by the absolute value of the determinant]] of this matrix. If more generally, (as is the samecase here), an <math>m</math>-dimensional paralellotope is embedded in <math>\R^n</math>, it can be represented with a (tall) <math>n\text{-by-}m</math> matrix, say <math>\mathbf V</math>. Denoting the parallelotope as <math>/\mathbf V\!/</math>, whoseits volume is then given by the square root of the [[gramGram determinant]],:
:<math>
\operatorname{volume}/\mathbf V\!/=\sqrt{\left|\operatorname{det}(\mathbf V'\mathbf V)\right|}
</math>.
In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.
 
=== Simplex flow===
As a first example, we develop expressions for the differential volume ratio inof thea simplex caseflow, when<math>\mathbf q=f(\mathbf p)</math>, where <math>\mathbf p, \mathbf q\in\mathcal M=\Delta^{n-1}</math>. Define the '''embedding function''',:
:<math>e:\tilde\mathbf p=(p_1\dots,p_{n-1})\mapsto\mathbf p=(p_1\dots,p_{n-1},1-\sum_{i=1}^{n-1}p_i)
</math>
which maps ana arbitrarilyconveniently chosen, <math>m=(n-1)</math>-dimensional repesentationrepresentation, <math>\tilde\mathbf p</math>, to the embedded manifold. The <math>n\text{-by-}m(n-1)</math> Jacobian is
<math>\mathbf E = \begin{bmatrix}
\mathbf{I}_{n-1} \\
Line 216 ⟶ 226:
\end{bmatrix}
</math>.
To define <math>U</math>, startthe withdifferential avolume rectangleelement inat the transformation input (<math>\Rmathbf p\in\Delta^{n-1}</math>), we start with a rectangle in <math>\tilde\mathbf p</math>-space, having (signed) differential side-lengths, <math>dp_1, \dots, dp_{n-1}</math> andfrom usewhich these towe form the square diagonal matrix <math>(n-1)\text{-by-}(n-1)mathbf D</math>, diagonalthe matrixcolumns of which span the rectangle. At very small scale, we get <math>U=e(\mathbf D)=/\mathbf{ED}\!/</math>., with:
[[File:Simplex measure pullback.svg|frame|right|For the 1-simplex (blue) embedded in <math>\R^2</math>, when we pull back [[Lebesgue measure]] from [[tangent space]] (coincidingparallel withto the simplex), via the embedding <math>p_1\mapsto(p_1,1-p_1)</math>, with Jacobian <math>\mathbf E=\begin{bmatrix}1&-1\end{bmatrix}'</math>, a scaling factor of <math>\sqrt{\mathbf E'\mathbf E}=\sqrt2</math> results.]]
:<math>\operatorname{volume}(U) = \sqrt{\left|\operatorname{det}(\mathbf{DE}'\mathbf{ED})\right|}
This gives:
:<math>
\operatorname{volume}(U) = \sqrt{\left|\operatorname{det}(\mathbf{DE}'\mathbf{ED})\right|}
= \sqrt{\left|\operatorname{det}(\mathbf E'\mathbf E)\right|}\,
\left|\operatorname{det}\mathbf D)\right|
=\sqrt n\prod_{i=1}^{n-1} \left|dp_i\right|
</math>
To understand the geometric interpretation of the factor <math>\sqrt{n}</math>, see the example for the 1-simplex in the diagram at right. Similarly, using <math>\mathbf{F_p}</math> for the <math>n\text{-by-}n</math> Jacobian of <math>f</math> at <math>\mathbf p=e(\tilde\mathbf p)</math>we get:
 
The differential volume element at the transformation output (<math>\mathbf q\in\Delta^{n-1}</math>), is the parallelotope, <math>V=f(U)=/\mathbf{F_pED}\!/</math>, where <math>\mathbf{F_p}</math> is the <math>n\text{-by-}n</math> Jacobian of <math>f</math> at <math>\mathbf p=e(\tilde\mathbf p)</math>. Its volume is:
:<math>
\operatorname{volume}(V) =
Line 232 ⟶ 242:
\left|\operatorname{det}\mathbf D)\right|
</math>
so that the factor <math>\left|\operatorname{det}\mathbf D)\right|</math> cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the '''representation function''', <math>r:\mathbf p\mapsto\tilde\mathbf p</math>, which simply extracts the first <math>(n-1)</math> components. The Jacobian is <math>\mathbf R=\begin{bmatrix}\mathbf I_n&\boldsymbol0\end{bmatrix}</math>. Observe that, since <math>e\circ r\circ f=f</math>, the [[Chain_ruleChain rule#General_ruleGeneral rule:_Vector Vector-valued_functions_with_multiple_inputsvalued functions with multiple inputs|chain rule for function composition]] gives: <math>\mathbf{ERF_p}=\mathbf{F_p}</math>. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor <math>\sqrt{\left|\operatorname{det}(\mathbf E'\mathbf E)\right|}=\sqrt n</math>, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, <math>r\circ f\circ e</math>:
:<math>
R^\Delta_f(\mathbf p)=\frac{\operatorname{volume}(V)}{\operatorname{volume}(U)}
Line 242 ⟶ 252:
\mathbf p=f^{-1}(\mathbf q)
</math>
It should be noted that thisThis formula is valid only because the simplex is flat and the Jacobian, <math>\mathbf E</math> is constant. The more general case for curved manifolds is discussed below, after we present atwo concrete exampleexamples of a simplex transformflow transforms.
 
====Simplex calibration transform====
A '''[[Dirichlet distribution#Generalization by scaling and translation of log-probabilities|calibration transform''']], <math>f_\text{cal}:\Delta^{n-1}\to\Delta^{n-1}</math>, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic <math>n</math>-class classifier,<ref>{{Cite arXivconference|last1=Brümmer
|first1=Niko
|last2=van Leeuwen
|first2=D. A.
|title=On calibration of language recognition scores
|book-title=Proceedings of IEEE Odyssey: The Speaker and Language Recognition Workshop
|year=2006
|___location=San Juan, Puerto Rico
|pages=1–8
|doi=10.1109/ODYSSEY.2006.248106}}
</ref><ref>{{Cite arXiv
|last1=Ferrer
|first1=Luciana
Line 253 ⟶ 273:
|eprint=2408.02841
|year=2024
|class=stat.ML
}}</ref> uses the [[softmax function]] to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For <math>\mathbf p, \mathbf q\in\Delta^{n-1}</math> and with parameters, <math>a>0</math> and <math>\mathbf c\in\R^n</math> the transform can be specified as:
}}</ref> uses the [[softmax function]] to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For <math>\mathbf p, \mathbf q\in\Delta^{n-1}</math> and with parameters, <math>a\ne0</math> and <math>\mathbf c\in\R^n</math> the transform can be specified as:
:<math>
\mathbf q=f_\text{cal}(\mathbf p; a, \mathbf c) = \operatorname{softmax}(a^{-1}\log\mathbf p+\mathbf c)\;\iff\;
\mathbf p=f^{-1}_\text{cal}(\mathbf q; a, \mathbf c) = \operatorname{softmax}(a\log\mathbf q-a\mathbf c)
</math>
where the log is applied elementwise. After some algebra the '''differential volume ratio''' can be expressed as:
:<math>
R^\Delta_\text{cal}(\mathbf p; a, \mathbf c) = \left|\operatorname{det}(\mathbf{RF_pE})\right| = \left|a\right|^{1-n}\prod_{i=1}^n\frac{q_i}{p_i}
</math>
* This result can also be obtained by factoring the density of the [[SGB distribution]],<ref name="sgb">{{cite web |last1=Graf |first1=Monique (2019)|title=The Simplicial Generalized Beta distribution - R-package SGB and applications |url=https://libra.unine.ch/server/api/core/bitstreams/dd593778-b1fd-4856-855b-7b21e005ee77/content |website=Libra |access-date=26 May 2025}}</ref> which is obtained by sending [[Dirichlet distribution|Dirichlet]] variates through <math>f_\text{cal}</math>.
While calibration transforms are most often trained as [[discriminative model|discriminative models]], the reinterpretation here as a flow allows also the design of [[generative model|generative]] calibration models.
While calibration transforms are most often trained as [[discriminative model]]s, the reinterpretation here as a probabilistic flow allows also the design of [[generative model|generative]] calibration models based on this transform. When used for calibration, the restriction <math>a>0</math> can be imposed to prevent direction reversal in log-probability space. With the additional restriction <math>\mathbf c=\boldsymbol0</math>, this transform (with discriminative training) is known in machine learning as [[Platt scaling#Analysis|temperature scaling]].
 
====Generalized calibration transform====
The above calibration transform can be generalized to <math>f_\text{gcal}:\Delta^{n-1}\to\Delta^{n-1}</math>, with parameters <math>\mathbf c\in\R^n</math> and <math>\mathbf A</math> <math>n\text{-by-}n</math> invertible:<ref>{{Cite thesis
|last1=Brümmer
|first1=Niko
|title=Measuring, refining and calibrating speaker and language information extracted from speech
|type=PhD thesis
|institution=Department of Electrical & Electronic Engineering, University of Stellenbosch
|___location=Stellenbosch, South Africa
|date=18 October 2010
|url=https://scholar.sun.ac.za/items/1b46805b-2b1e-46aa-83ce-75ede92f0159
}}</ref>
:<math>
\mathbf q = f_\text{gcal}(\mathbf p;\mathbf A,\mathbf c)
= \operatorname{softmax}(\mathbf A\log\mathbf p + \mathbf c)\,,\;\text{subject to}\;
\mathbf{A1}=\lambda\mathbf1
</math>
where the condition that <math>\mathbf A</math> has <math>\mathbf1</math> as an [[eigenvector]] ensures invertibility by sidestepping the information loss due to the invariance: <math>\operatorname{softmax}(\mathbf x+\alpha\mathbf1)=\operatorname{softmax}(\mathbf x)</math>. Note in particular that <math>\mathbf A=\lambda\mathbf I_n</math> is the ''only'' allowed diagonal parametrization, in which case we recover <math>f_\text{cal}(\mathbf p;\lambda^{-1},\mathbf c)</math>, while (for <math>n>2</math>) generalization ''is'' possible with non-diagonal matrices. The '''inverse''' is:
:<math>
\mathbf p = f_\text{gcal}^{-1}(\mathbf q;\mathbf A, \mathbf c)
= f_\text{gcal}(\mathbf q;\mathbf A^{-1}, -\mathbf A^{-1}\mathbf c)\,,\;\text{where}\;
\mathbf{A1}=\lambda\mathbf1\Longrightarrow\mathbf{A}^{-1}\mathbf1=\lambda^{-1}\mathbf1
</math>
The '''differential volume ratio''' is:
:<math>
R^\Delta_\text{gcal}(\mathbf p;\mathbf A,\mathbf c)
=\frac{\left|\operatorname{det}(\mathbf A)\right|}{|\lambda|}\prod_{i=1}^n\frac{q_i}{p_i}
</math>
If <math>f_\text{gcal}</math> is to be used as a calibration transform, further constraint could be imposed, for example that <math>\mathbf A</math> be [[positive definite matrix|positive definite]], so that <math>(\mathbf{Ax})'\mathbf x>0</math>, which avoids direction reversals. (This is one possible generalization of <math>a>0</math> in the <math>f_\text{cal}</math> parameter.)
 
For <math>n=2</math>, <math>a>0</math> and <math>\mathbf A</math> positive definite, then <math>f_\text{cal}</math> and <math>f_\text{gcal}</math> are equivalent in the sense that in both cases, <math>\log\frac{p_1}{p_2}\mapsto\log\frac{q_1}{q_2}</math> is a straight line, the (positive) slope and offset of which are functions of the transform parameters. For <math>n>2,</math> <math>f_\text{gcal}</math> ''does'' generalize <math>f_\text{cal}</math>.
 
It must however be noted that chaining multiple <math>f_\text{gcal}</math> flow transformations does ''not'' give a further generalization, because:
:<math>
f_\text{gcal}(\cdot\,;\mathbf A_1,\mathbf c_1) \circ
f_\text{gcal}(\cdot\,;\mathbf A_2,\mathbf c_2)
= f_\text{gcal}(\cdot\,;\mathbf A_1\mathbf A_2,\mathbf c_1+\mathbf A_1\mathbf c_2)
</math>
In fact, the set of <math>f_\text{gcal}</math> transformations form a [[group mathematics|group]] under function composition. The set of <math>f_\text{cal}</math> transformations form a subgroup.
 
Also see: '''Dirichlet calibration''',<ref>{{cite arXiv
| title = Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration
| author = Meelis Kull, Miquel Perelló‑Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter A. Flach
| eprint = 1910.12656
| date = 28 October 2019
| class = cs.LG
}}</ref> which generalizes <math>f_\text{gcal}</math>, by not placing any restriction on the matrix, <math>\mathbf A</math>, so that invertibility is not guaranteed. While Dirichlet calibration is trained as a discriminative model, <math>f_\text{gcal}</math> can also be trained as part of a generative calibration model.
 
===Differential volume ratio for curved manifolds===
ForConsider a flow, <math>\mathbf y=f(\mathbf x)</math> on a curved manifold, for example <math>\mathbb S^{n-1}</math>, equippedwhich anwe equip with the embedding function, <math>e</math> that maps a set of <math>(n-1)</math> angular [[N-sphere#Spherical coordinates|angular spherical coordinates]] to <math>\mathbb S^{n-1}</math>,. theThe Jacobian of <math>e</math> is non-constant and we have to evaluate it at both input (<math>\mathbf {E_x}</math>) and output (<math>\mathbf {E_y}</math>). The same applies to <math>r</math>, the represententationrepresentation function that recovers spherical coordinates from points inon <math>\Rmathbb S^{n-1}</math>, wherefor which we will need the Jacobian at the output: (<math>\mathbf{R_y}</math>). The differential volume ratio now generalizes to:
:<math>
R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{R_yF_xE_x})\right|\,\frac{\sqrt{\left|\operatorname{det}(\mathbf{E_y} E_\mathbf y'\mathbf{E_y})\right|}}{\sqrt{\left|\operatorname{det}(\mathbf{E_x} E_\mathbf x'\mathbf{E_x})\right|}}
</math>
For geometric insight, consider <math>\mathbf S^2</math>, where we choose to work withthe spherical coordinates, withare co-latitude, <math>\theta\in[0,\pi]</math> and longitude, <math>\phi\in[0,2\pi)</math>,. where atAt <math>\mathbf x = e(\theta,\phi)</math>, we get <math>\sqrt{\left|\operatorname{det}(\mathbf{E_x} E_\mathbf x'\mathbf{E_x})\right|}=\sin\theta</math>, which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is: <math>\sin\theta\,d\theta\,d\phi</math>.
 
WeThe ''can''above however, more generallyderivation for isometrically<math>R_f</math> embeddedis manifolds,fragile simplify this formula by abandoningin the fixedsense coordinatethat systemwhen as given by theusing ''fixed'' functions <math>e,r</math>., there may be places where they are not well-defined, for Insteadexample at the inputpoles andof output,the we2-sphere choosewhere twolongitude differentis localarbitrary. coordinateThis systems,problem theis embeddingsidestepped Jacobians(using ofstandard whichmanifold aremachinery) <math>n\text{-by-} generalizing to ''local'' coordinates (n-1charts)</math>, matriceswhere (callin themthe vicinities of <math>\mathbf{T_x}, x,\mathbf{T_y} y\in\mathcal M</math>), withwe orthonormalmap columns thatfrom spanlocal the<math>m</math>-dimensional localcoordinates tangentto spaces.<math>\R^n</math> Thenand weback getusing the respective function pairs <math>e_{\mathbf{E_x x}=, r_{\mathbf{T_x x}</math> and <math>e_{\mathbf{R_y y}=, r_{\mathbf{T_y y}'</math>. and,We sincecontinue to use the same notation for the Jacobians of these functions (<math>\mathbf{T_xE_x}', \mathbf{T_xE_y}=, \mathbf{T_y}'\mathbf{T_y}=\mathbf I_{n-1R_y}</math>), so that the squareabove rootformula factorsfor cancel,<math>R_f</math> toremains give:valid.
 
We ''can'' however, choose our local coordinate system in a way that simplifies the expression for <math>R_f</math> and indeed also its practical implementation.<ref name=manifold_flow/> Let <math>\pi:\mathcal P\to\R^n</math> be a smooth idempotent projection (<math>\pi\circ\pi=\pi</math>) from the ''projectible set'', <math>\mathcal P\subseteq\R^n</math>, onto the embedded manifold. For example:
* The positive orthant of <math>\R^n</math> is projected onto the '''simplex''' as: <math>\pi(\mathbf z)=\bigl(\sum_{i=1}^n z_i\bigr)^{-1}\mathbf z</math>
* Non-zero vectors in <math>\R^n</math> are projected onto the '''unitsphere''' as: <math>\pi(\mathbf z)=\bigl(\sum_{i=1}^n z^2_i\bigr)^{-\frac12}\mathbf z</math>
For every <math>\mathbf x\in\mathcal M</math>, we require of <math>\pi</math> that its <math>n\text{-by-}n</math> Jacobian, <math>\boldsymbol{\Pi_x}</math> has rank <math>m</math> (the manifold dimension), in which case <math>\boldsymbol{\Pi_x}</math> is an [[projection (linear algebra)|idempotent linear projection]] onto the local tangent space (''orthogonal'' for the unitsphere: <math>\mathbf I_n-\mathbf{xx}'</math>; ''oblique'' for the simplex: <math>\mathbf I_n-\boldsymbol{x1}'</math>). The columns of <math>\boldsymbol{\Pi_x}</math> span the <math>m</math>-dimensional tangent space at <math>\mathbf x</math>. We use the notation, <math>\mathbf{T_x}</math> for any <math>n\text{-by-}m</math> matrix with orthonormal columns (<math>\mathbf T_{\mathbf x}'\mathbf{T_x}=\mathbf I_m</math>) that span the local tangent space. Also note: <math>\boldsymbol{\Pi_x}\mathbf{T_x}=\mathbf{T_x}</math>. We can now choose our local coordinate embedding function, <math>e_\mathbf x:\R^m\to\R^n</math>:
:<math>
e_\mathbf x(\tilde x) = \pi(\mathbf x + \mathbf{T_x\tilde x})\,,
\text{with Jacobian:}\,\mathbf{E_x}=\mathbf{T_x}\,\text{at}\,\tilde\mathbf x=\mathbf0.
</math>
Since the Jacobian is injective (full rank: <math>m</math>), a local (not necessarily unique) [[left inverse function|left inverse]], say <math>r^*_\mathbf x</math> with Jacobian <math>\mathbf R^*_\mathbf x</math>, exists such that <math>r^*_\mathbf x(e_\mathbf x(\tilde x))=\tilde x</math> and <math>\mathbf R^*_\mathbf x\mathbf{T_x}=\mathbf I_m</math>. In practice we do not need the left inverse function itself, but we ''do'' need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as, <math>r_\mathbf x:\R^n\to\R^m</math>:
:<math>
r_\mathbf x(\mathbf z) = r^*_\mathbf x(\pi(\mathbf z))\,,\text{with Jacobian:}\,
\mathbf{R_x}=\mathbf T_\mathbf x'
</math>
We can now finally plug <math>\mathbf{E_x}=\mathbf{T_x}</math> and <math>\mathbf{R_y}=\mathbf T_\mathbf y'</math> into our previous expression for <math>R_f</math>, the '''differential volume ratio''', which because of the orthonormal Jacobians, simplifies to:<ref>The tangent matrices are not unique: if <math>\mathbf T </math> has orthonormal columns and <math>\mathbf Q</math> is an [[orthogonal matrix]], then <math>\mathbf{TQ}</math> also has orthonormal columns that span the same subspace; it is easy to verify that <math>\left|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right|</math> is invariant to such transformations of the tangent representatives.</ref>
:<math>
R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right|
</math>
which agrees with the formula from Sorrenson et al.(2023),<ref name='manifold_flow'/> which they derived via a different argument. Moreover, in that paper they give a general method to construct the tangent matrices and how to efficently compute stochastic gradient approximations for the differential volume log-ratio during learnin g.
 
====Practical implementation====
The tangent matrix at a point <math>\mathbf x\in\mathcal M</math> is not unique. If <math>\mathbf{T_x}</math>, <math>n\text{-by-}m</math>, spans the local tangent space with orthonormal columns, then so does <math>\mathbf{T_xQ}</math>, if <math>\mathbf Q</math> is any <math>m\text{-by-}m</math> [[orthogonal matrix]] and it is easy to check that the above determinant formula is invariant to such reparametrization of the tangent space. For our running examples:
For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, <math>R_f</math>, or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to <math>R_f</math> itself. Practical solutions include:
* The tangent space of the simplex, <math>\Delta^{n-1}</math>, is the hyperplane that coincides with the simplex and it is the same everywhere.
*Sorrenson et al.(2023)<ref name=manifold_flow/> give a solution for computationally efficient stochastic parameter gradient approximation for <math>\log R_f.</math>
* The local tangent space at <math>\mathbf x\in\mathcal S^{n-1}</math> is the hyperplane perpendicular to the radius, <math>\mathbf x</math> and it is the image (column space) of the [[Projection_(linear_algebra)#Orthogonal_projection|orthogonal projection matrix]], <math>\mathbf I_n-\mathbf{xx}'</math>.
*For some hand-designed flow transforms, <math>R_f</math> can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. Further examples are given below in the section on simple spherical flows.
*On a software platform equipped with [[linear algebra]] and [[automatic differentiation]], <math>R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right|</math> can be automatically evaluated, given access to only <math>\mathbf x, f, \pi</math>.<ref>With [[PyTorch]]:
<pre>
from torch.linalg import qr
from torch.func import jacrev
def logRf(pi, m, f, x):
y = f(x)
Fx, PI = jacrev(f)(x), jacrev(pi)
Tx, Ty = [qr(PI(z)).Q[:,:m] for z in (x,y)]
return (Ty.T @ Fx @ Tx).slogdet().logabsdet
</pre></ref> But this is expensive for high-dimensional data, with at least <math>\mathcal O(n^3)</math> computational costs. Even then, the slow automatic solution can be invaluable as a tool for numerically verifying hand-designed closed-form solutions.
 
=== Simple spherical flows ===
In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found.<ref name=manifold_flow/> In contrast, this section compiles from ''statistics'' literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere, <math>\mathbb S^{n-1}</math>. All three flows are compositions of an invertible affine transform in <math>\R^n</math>, followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are:
* To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility.
* To derive in simple closed form the '''differential volume ratio''', <math>R_f</math>.
An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps surprisingly flexible distributions.
 
==== Normalized translation flow ====
The normalized translation flow, <math>f_\text{trans}:\mathbb S^{n-1}\to\mathbb S^{n-1}</math>, with parameter <math>\mathbf c\in\R^n</math>, is given by:
:<math>
\mathbf y = f_\text{trans}(\mathbf x;\mathbf c)
=\frac{\mathbf x + \mathbf c}{\lVert\mathbf x + \mathbf c\rVert}\,,
\;\text{where}\;\lVert\mathbf c\rVert < 1
</math>
The inverse function may be derived by considering, for <math>\ell>0</math>: <math>\mathbf y=\ell^{-1}(\mathbf x+\mathbf c)</math> and then using <math>\mathbf x'\mathbf x=1</math> to get a [[quadratic equation]] to recover <math>\ell</math>, which gives:
:<math>
\mathbf x = f^{-1}_\text{trans}(\mathbf y;\mathbf c) = \ell\mathbf y - \mathbf c\,,\text{where}\;
\ell = \mathbf y'\mathbf c +\sqrt{(\mathbf y'\mathbf c)^2+1-\mathbf c'\mathbf c}
</math>
from which we see that we need <math>\lVert\mathbf c\rVert < 1</math> to keep <math>\ell</math> real and positive for all <math>\mathbf y\in\mathbb S^{n-1}</math>. The '''differential volume ratio''' is given (without derivation) by Boulerice & Ducharme(1994) as:<ref name=BDflow>
{{cite journal
|last1=Boulerice
|first1=Bernard
|last2=Ducharme
|first2=Gilles R.
|title=Decentered Directional Data
|journal=Annals of the Institute of Statistical Mathematics
|volume=46
|issue=3
|pages=573–586
|year=1994
|doi=10.1007/BF00773518
}}
</ref>
:<math>
R_\text{trans}(\mathbf x;\mathbf c) = \frac{1+\mathbf x'\mathbf c}{\lVert\mathbf x +\mathbf c\rVert^n}
</math>
This can indeed be verified analytically:
*By a laborious manipulation of <math>R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right|</math>.
*By setting <math>\mathbf M=\mathbf I_n</math> in <math>R_\text{aff}(\mathbf x;\mathbf M, \mathbf c)</math>, which is given below.
Finally, it is worth noting that <math>f_\text{trans}</math> and <math>f^{-1}_\text{trans}</math> do not have the same functional form.
 
==== Normalized linear flow ====
The normalized linear flow, <math>f_\text{lin}:\mathbb S^{n-1}\to\mathbb S^{n-1}</math>, where parameter <math>\mathbf M</math> is an invertible <math>n\text{-by-}n</math> matrix, is given by:
:<math>
\mathbf y = f_\text{lin}(\mathbf x;\mathbf M)
=\frac{\mathbf{Mx}}{\lVert\mathbf{Mx}\rVert}
\;\iff\;
\mathbf x = f^{-1}_\text{lin}(\mathbf y;\mathbf M)
= f_\text{lin}(\mathbf y;\mathbf M^{-1})
=\frac{\mathbf{M^{-1}y}}{\lVert\mathbf{M^{-1}y}\rVert}
</math>
The '''differential volume ratio''' is:
:<math>
R_\text{lin}(\mathbf x; \mathbf M) =
\frac{\left|\operatorname{det}\mathbf M\right|}
{\lVert\mathbf{Mx}\rVert^n}
</math>
This result can be derived indirectly via the '''Angular central Gaussian distribution (ACG)''',<ref>
{{cite journal|title=Statistical analysis for the angular central Gaussian distribution on the sphere|last1=Tyler|first1=David E|journal=Biometrika|volume=74|number=3|pages=579–589|year=1987|doi=10.2307/2336697|jstor=2336697 }}
</ref> which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, see [[ACG distribution]].
 
==== Normalized affine flow ====
The normalized affine flow, <math>f_\text{aff}:\mathbb S^{n-1}\to\mathbb S^{n-1}</math>, with parameters <math>\mathbf c\in\R^n</math> and <math>\mathbf M</math>, <math>n\text{-by-}n</math> invertible, is given by:
:<math>
f_\text{aff}(\mathbf x;\mathbf M, \mathbf c)
=\frac{\mathbf{Mx} + \mathbf c}{\lVert\mathbf{Mx} + \mathbf c\rVert}\,,
\;\text{where}\;\lVert\mathbf{M^{-1}c}\rVert < 1
</math>
The inverse function, derived in a similar way to the normalized translation inverse is:
:<math>
\mathbf x = f^{-1}_\text{aff}(\mathbf y;\mathbf M,\mathbf c) = \mathbf M^{-1}(\ell\mathbf y - \mathbf c)\,,\text{where}\;
\ell = \frac{\mathbf y'\mathbf{Wc} +\sqrt{(\mathbf y'\mathbf{Wc})^2+\mathbf y'\mathbf{Wy}(1-\mathbf c'\mathbf{Wc})}}{\mathbf y'\mathbf{Wy}}
</math>
where <math>\mathbf W=(\mathbf{MM}')^{-1}</math>. The '''differential volume ratio''' is:
:<math>
R_\text{aff}(\mathbf x; \mathbf M, \mathbf c)
=R_\text{lin}(\mathbf x; \mathbf M+\mathbf c\mathbf x') =
\frac{\left|\operatorname{det}\mathbf M\right|(1+\mathbf x'\mathbf{M^{-1}c})}
{\lVert\mathbf{Mx+c}\rVert^n}
</math>
The final RHS numerator was expanded from <math>\operatorname{det}(\mathbf M + \mathbf{cx}')</math> by the [[matrix determinant lemma]]. Recalling <math>R_f(\mathbf x)=\left|\operatorname{det}(\mathbf T_\mathbf y'\mathbf{F_xT_x})\right|</math>, the equality between <math>R_\text{aff}</math> and <math>R_\text{lin}</math> holds because not only:
:<math>\mathbf x'\mathbf x=1\;\Longrightarrow\;\mathbf y = f_\text{aff}(\mathbf x; \mathbf{M,c})=f_\text{lin}(\mathbf x; \mathbf{M+cx}')
</math>
but also, by orthogonality of <math>\mathbf x</math> to the local tangent space:
:<math>
\mathbf x'\mathbf{T_x}=\boldsymbol0\;\Longrightarrow\;\mathbf F_\mathbf x^\text{aff}\mathbf{T_x} = \mathbf F_\mathbf x^\text{lin}\mathbf{T_x}
</math>
where <math>\mathbf F_\mathbf x^\text{lin}=\lVert\mathbf{Mx}+\mathbf c\rVert^{-1}(\mathbf I_n-\mathbf{yy}')(\mathbf{M+cx}')</math> is the Jacobian of <math>f_\text{lin}</math> differentiated w.r.t. its input, but ''not'' also w.r.t. to its parameter.
 
== Downsides ==