Revision as of 07:21, 18 July 2022 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Masked autoregressive flow (MAF): proof of universal approx Tag: Visual edit ← Previous edit		Revision as of 10:38, 18 July 2022 edit undo Keith D (talk \| contribs) Autopatrolled, Administrators 575,383 edits Fix cite date error Tag: AWB Next edit →
Line 48: == Training method == Flow-based models are generally trained by [[Maximum likelihood estimation\|maximum likelihood]]. A pseudocode is as follows:<ref>{{Cite journal \|last=Kobyzev \|first=Ivan \|last2=Prince \|first2=Simon J.D. \|last3=Brubaker \|first3=Marcus A. \|date=November 2021~~-11~~ \|title=Normalizing Flows: An Introduction and Review of Current Methods \|url=https://ieeexplore.ieee.org/abstract/document/9089305/ \|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence \|volume=43 \|issue=11 \|pages=3964–3979 \|doi=10.1109/TPAMI.2020.2992934 \|issn=1939-3539}}</ref> * INPUT. dataset <math>x_{1:n}</math>, normalizing flow model <math>f_\theta(\cdot), p_0 </math>. Line 57: === Planar Flow === The earliest example.<ref name=":0" /> Fix some activation function <math>h</math>, and let <math>\theta = (u, w, b)</math> with th appropriate dimensions, then<math display="block">x = f_\theta(z) = z + u h(\langle w, z \rangle + b)</math>The inverse <math>f_\theta^{-1}</math> has no closed-form solution in general. The Jacobian is <math>\|\det (I + h'(\langle w, z \rangle + b) uw^T)\| = \|1 + h'(\langle w, z \rangle + b) \langle u, w\rangle\|</math>. ~~The~~For ~~Jacobian~~it isto ~~<math>\|\det~~be (Iinvertible ~~+ h'(\langle w~~everywhere, zit ~~\rangle~~must +be b)nonzero ~~uw^T)\|~~everywhere. =For ~~\|1 +~~example, <math>h~~'(\langle~~ ~~w, z~~= \~~rangle~~tanh</math> ~~+ b)~~and <math>\langle u, w \rangle\| > -1</math>. satisfies the requirement. ~~For it to be invertible everywhere, it must be nonzero everywhere. For example, <math>h = \tanh</math> and <math>\langle u, w \rangle > -1</math> satisfies the requirement.~~ === Nonlinear Independent Components Estimation (NICE) === Line 71 ⟶ 70: \end{bmatrix} + \begin{bmatrix} 0 \\ m_\theta(z_1) \end{bmatrix}</math>where <math>m_\theta</math> is any neural network with weights <math>\theta</math>. <math>f_\theta^{-1}</math> is just <math>z_1 = x_1, z_2 = x_2 - m_\theta(x_1)</math>, and the Jacobian is just 1, that is, the flow is volume-preserving. ▼ When <math>n=1</math>, this is seen as a curvy shearing along the <math>x_2</math> direction. ▼ ▲<math>f_\theta^{-1}</math> is just <math>z_1 = x_1, z_2 = x_2 - m_\theta(x_1)</math>, and the Jacobian is just 1, that is, the flow is volume-preserving. ▲When <math>n=1</math>, this is seen as a curvy shearing along the <math>x_2</math> direction. === Real Non-Volume Preserving (Real NVP) === Line 86 ⟶ 84: 0 \\ m_\theta(z_1) \end{bmatrix}</math> Its inverse is <math>z_1 = x_1, z_2 = e^{-s_\theta (x_1)}\odot (x_2 - m_\theta (x_1))</math>, and its Jacobian is <math>\prod^n_{i=1} e^{s_\theta(z_{1, })}</math>. The NICE model is recovered by setting <math>s_\theta = 0</math>. Since the Real NVP map keeps the first and second halves of the vector <math>x</math> separate, it's usually required to add a permutation <math>(x_1, x_2) \mapsto (x_2, x_1)</math> after every Real NVP layer. === Generative Flow (Glow) === In generative flow model,<ref name="glow" /> each layer has 3 parts: * channel-wise affine transform<math display="block">y_{cij} = s_c(x_{cij} + b_c)</math>with Jacobian <math>\prod_c s_c^{HW}</math>. * invertible 1x1 convolution<math display="block">z_{cij} = \sum_{c'} K_{cc'} y_{cij}</math>with Jacobian <math>\det(K)^{HW}</math>. Here <math>K</math> is any invertible matrix. * Real NVP, with Jacobian as described in Real NVP. The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP. === Masked autoregressive flow (MAF) === An autoregressive model of a distribution on <math>\R^n</math> is defined as the following stochastic process:<ref>{{Cite journal \|last=Papamakarios \|first=George \|last2=Pavlakou \|first2=Theo \|last3=Murray \|first3=Iain \|date=2017 \|title=Masked Autoregressive Flow for Density Estimation \|url=https://proceedings.neurips.cc/paper/2017/hash/6c1da886822c67822bcf3679d04369fa-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=30}}</ref> <math display="block">\begin{align} Line 116 ⟶ 112: &\cdots \\ x_n =& \mu_n(x_{1:n-1}) + \sigma_n(x_{1:n-1}) z_n\\ \end{align}</math>The autoregressive model is recovered by setting <math>z \sim N(0, I_{n})</math>. The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel). The Jacobian matrix is lower-diagonal, so the Jacobian is <math>\sigma_1 \sigma_2(x_1)\cdots \sigma_n(x_{1:n-1})</math>. Reversing the two maps <math>f_\theta</math> and <math>f_\theta^{-1}</math> of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.<ref>{{Cite journal \|last=Kingma \|first=Durk P \|last2=Salimans \|first2=Tim \|last3=Jozefowicz \|first3=Rafal \|last4=Chen \|first4=Xi \|last5=Sutskever \|first5=Ilya \|last6=Welling \|first6=Max \|date=2016 \|title=Improved Variational Inference with Inverse Autoregressive Flow \|url=https://proceedings.neurips.cc/paper/2016/hash/ddeebdeefdb7e7e7a697e1c3e3d8ef54-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=29}}</ref>. ▼ ▲Reversing the two maps <math>f_\theta</math> and <math>f_\theta^{-1}</math> of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping<ref>{{Cite journal \|last=Kingma \|first=Durk P \|last2=Salimans \|first2=Tim \|last3=Jozefowicz \|first3=Rafal \|last4=Chen \|first4=Xi \|last5=Sutskever \|first5=Ilya \|last6=Welling \|first6=Max \|date=2016 \|title=Improved Variational Inference with Inverse Autoregressive Flow \|url=https://proceedings.neurips.cc/paper/2016/hash/ddeebdeefdb7e7e7a697e1c3e3d8ef54-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=29}}</ref>. === Continuous Normalizing Flow (CNF) === Line 141 ⟶ 138: Since the trace depends only on the diagonal of the Jacobian <math>\partial_{z_t} f</math>, this allows "free-form" Jacobian.<ref>{{Cite journal \|last=Grathwohl \|first=Will \|last2=Chen \|first2=Ricky T. Q. \|last3=Bettencourt \|first3=Jesse \|last4=Sutskever \|first4=Ilya \|last5=Duvenaud \|first5=David \|date=2018-10-22 \|title=FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models \|url=http://arxiv.org/abs/1810.01367 \|journal=arXiv:1810.01367 [cs, stat]}}</ref> Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently. The trace can be estimated by "Hutchinson's trick":<ref name="Finlay 3154–3164">{{Cite journal \|last=Finlay \|first=Chris \|last2=Jacobsen \|first2=Joern-Henrik \|last3=Nurbekyan \|first3=Levon \|last4=Oberman \|first4=Adam \|date=2020-11-21 \|title=How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization \|url=https://proceedings.mlr.press/v119/finlay20a.html \|journal=International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=3154–3164}}</ref><ref>{{Cite journal \|last=Hutchinson \|first=M.F. \|date=January 1989~~-01~~ \|title=A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines \|url=http://www.tandfonline.com/doi/abs/10.1080/03610918908812806 \|journal=Communications in Statistics - Simulation and Computation \|language=en \|volume=18 \|issue=3 \|pages=1059–1076 \|doi=10.1080/03610918908812806 \|issn=0361-0918}}</ref>:<blockquote>Given any matrix <math>W\in \R^{n\times n}</math>, and any random <math>u\in \R^n</math> with <math>E[uu^T] = I</math>, we have <math>E[u^T W u] = tr(W)</math>. (Proof: expand the expectation directly.)</blockquote>Usually, the random vector is sampled from <math>N(0, I)</math> (normal distribution) or <math>\{\pm n^{-1/2}\}^n</math> ([[Rademacher distribution\|Radamacher distribution]]). When <math>f</math> is implemented as a neural network, [[neural ODE]] methods<ref>{{cite arXiv \| eprint=1806.07366\| last1=Chen\| first1=Ricky T. Q.\| last2=Rubanova\| first2=Yulia\| last3=Bettencourt\| first3=Jesse\| last4=Duvenaud\| first4=David\| title=Neural Ordinary Differential Equations\| year=2018\| class=cs.LG}}</ref> would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE. There are two main deficiencies of CNF, one is that a continuous flow must be a [[homeomorphism]], thus preserve orientation and [[ambient isotopy]] (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to [[Sphere eversion\|turn a sphere inside out]], or undo a knot), and the other is that the learned flow <math>f</math> might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible <math>f</math> that all solve the same problem). By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".<ref>{{Cite journal \|last=Dupont \|first=Emilien \|last2=Doucet \|first2=Arnaud \|last3=Teh \|first3=Yee Whye \|date=2019 \|title=Augmented Neural ODEs \|url=https://proceedings.neurips.cc/paper/2019/hash/21be9a4bd4f81549a9d1d241981cec3c-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32}}</ref> Line 151 ⟶ 148: Any homeomorphism of <math>\R^n</math> can be approximated by a neural ODE operating on <math>\R^{2n+1}</math>, proved by combining [[Whitney embedding theorem]] for manifolds and the [[universal approximation theorem]] for neural networks.<ref>{{Cite journal \|last=Zhang \|first=Han \|last2=Gao \|first2=Xi \|last3=Unterman \|first3=Jacob \|last4=Arodz \|first4=Tom \|date=2019-07-30 \|title=Approximation Capabilities of Neural ODEs and Invertible Residual Networks \|url=https://arxiv.org/abs/1907.12998v2 \|language=en \|doi=10.48550/arXiv.1907.12998}}</ref> To regularize the flow <math>f</math>, one can impose regularization losses on <math>\nabla_z f(z, t)</math>. The paper <ref~~>{{Cite~~ ~~journal \|last~~name="Finlay \|first=Chris \|last2=Jacobsen \|first2=Joern-Henrik \|last3=Nurbekyan \|first3=Levon \|last4=Oberman \|first4=Adam \|date=2020-11-21 \|title=How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization \|url=https://proceedings.mlr.press/v119/finlay20a.html \|journal=International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=3154–3164~~}}<~~"/~~ref~~> proposed the following regularization loss based on [[Optimal transport\|optimal transport theory]]: == Applications ==

Flow-based generative model: Difference between revisions