On the Universality of Volume-Preserving and Coupling-Based Normalizing Flows

Authors: Felix Draxler, Stefan Wahl, Christoph Schnoerr, Ullrich Koethe

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In summary, we contribute:We show for the first time that volume-preserving flows are not universal, derive what distribution they converge to instead and provide simple fixes for their shortcomings in Section 4. We show that whenever the target distribution is not perfectly learned, there is an affine coupling block that reduces the loss (Section 5.2). We use this result to give a new universality proof for coupling-based normalizing flows that is not volume-preserving, considers the full support of the distribution, and is not ill-conditioned in Section 5.3. Our results validate insights previously observed only empirically: Affine coupling blocks are an effective foundation for normalizing flows, and volume-preserving flows have limited expressive power. We also show that the most recent distributional universality proof for affine coupling-based normalizing flows by Koehler et al. (2021) constructs such a volume-preserving flow in Section 5.1. Our work theoretically grounds choosing coupling blocks for practical applications with normalizing flows, combined with their easy implementation and training and inference speed. We remove spurious constructions present in previous proofs and use a simple principle instead: Train a flow layer by layer. Using volume-preserving flows may have negatively affected existing work. We show what distribution p (x) they approximate instead of the true target p(x) and propose how universality can be recovered by learning the actual latent distribution after training. In experiment on a toy dataset for Figure 1, we demonstrate that a coupling flow constructed layer by layer as in Equation (28) learns a target distribution.
Researcher Affiliation Academia Felix Draxler 1 Stefan Wahl 1 Christoph Schn orr 1 Ullrich K othe 1 1Heidelberg University, Germany. Correspondence to: Felix Draxler <felix.draxler@iwr.uni-heidelberg.de>.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets No We construct a data distribution on a circle as a Gaussian mixture of M Gaussians with means mi = (r cos φi, r sin φi), where φi = 0, 1 M 2π, . . . , M 1 M 2π are equally spaced, and σi = 0.3. The target distribution is a two-dimensional Gaussian Mixture Model with two modes. The two modes have the same relative weight but different covariance matrices (Σ1 = I 0.2, Σ2 = I 0.1) and means (m1 = [ 0.5, 0.5], m2 = [0.5, 0.5]).
Dataset Splits No No explicit training/test/validation dataset splits were specified in the paper.
Hardware Specification No No specific hardware used to run experiments (e.g., CPU, GPU models) was mentioned in the paper.
Software Dependencies No We base our code on Py Torch (Paszke et al., 2019), Numpy (Harris et al., 2020), Matplotlib (Hunter, 2007) for plotting and Pandas (Mc Kinney, 2010; The pandas development team, 2020) for data evaluation.
Experiment Setup Yes We choose N = 226, B = 64, M = 20, α = 0.5, NQ = 10. The resulting flow has 64 × 2 × 100 = 12, 800 learnable parameters. The normalizing flow with a constant Jacobian determinant consists of 15 GIN coupling blocks as introduced in Sorrenson et al. (2019). The two subnetworks used to compute the parameters of the affine couplings are fully connected neural networks with two hidden layers and a hidden dimensionality of 128. ReLU activations are used. The weights of the linear layers of the subnetworks are initialized by applying the PyTorch implementation of the Xavier initialization (Glorot & Bengio, 2010). In addition, the weights and biases of the final layer of each subnetworks are set to zero. The networks are trained using the Adam (Kingma & Ba, 2017) with PyTorch’s default settings and an initial learning rate of 1 × 10−3 which is reduced by a factor of ten after 5000, 10000 and 15000 training iterations. In total, the training ran for 25000 iterations. In each iteration, a batch of size 128 was drawn from the target distribution to compute the negative log likelihood objective.