Fast and unified path gradient estimators for normalizing flows
Authors: Lorenz Vaitl, Ludwig Winkler, Lorenz Richter, Pan Kessel
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically establish its superior performance and reduced variance for several natural sciences applications. 5 NUMERICAL EXPERIMENTS In this section, we compare our fast path gradients with the conventional approaches for several normalizing flow architectures, both using forward and reverse KL optimization. |
| Researcher Affiliation | Collaboration | 1Machine Learning Group, TU Berlin, 2Zuse Institute Berlin, 3dida Datenschmiede Gmb H, 4Prescient Design, Genentech, Roche |
| Pseudocode | Yes | D ALGORITHMS In this section we state the different algorithms used in our work. As discussed in Section 4, due to the duality of the KL divergence, we can employ the algorithms for both the forward and the reverse KL. The different algorithms treated in this paper are: 1. The novel fast path gradient algorithm shown in Algorithm 1. 2. As a baseline, the method proposed in Vaitl et al. (2022a) shown in Algorithm 2. 3. As a further baseline, the same algorithm, amended to the GDRe G estimator (Bauer & Mnih, 2021) shown in Algorithm 3. |
| Open Source Code | Yes | Code for reproducing the experiments for GMM and U(1) at github.com/lenz3000/unified-path-gradients. |
| Open Datasets | Yes | Gaussian Mixture Model. As a tractable multimodal example, we consider a Gaussian mixture model in Rd with σ2 = 0.5, i.e. we choose the energy function... We draw 10,000 samples from the Gaussian mixture model (GMM) for the forward KL training, thus mimicking a finite yet large sample set. |
| Dataset Splits | No | The paper mentions 10,000 samples for training the GMM and 10,000 test samples, but does not explicitly state a separate validation split or how it's handled (e.g., using a portion of the training set for validation, or cross-validation). |
| Hardware Specification | Yes | Table 2: Factor of runtime increase (mean and standard deviation) in comparison to the standard gradient estimator, i.e., runtime path gradient/runtime standard gradient on an A100-80GB GPU. The upper set of experiments cover the explicitly invertible flows, applied to ϕ4 as treated in the experiments. The lower set covers implicitly invertible flows applied to U(1) theory. |
| Software Dependencies | No | The paper mentions the Adam optimizer (Kingma & Ba, 2015) but does not provide specific version numbers for software libraries like PyTorch, TensorFlow, or Python itself, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | The results in the experiments in Figures 1 and 9 are obtained after 10, 000 optimization steps, with a learning rate of 0.00001 with the Adam optimizer (Kingma & Ba, 2015) with a batch size of 4, 000. The target distribution p is identical to the setup in Section 5 and the forward ESS and NLL are evaluated with 10, 000 test samples. ϕ4 Field Theory. For our flow architecture we use a slightly modified NICE (Dinh et al., 2014) architecture, called Z2Nice (Nicoli et al., 2020), which is equivariant with respect to the Z2 symmetry of the ϕ4 action in (28). We use a lattice of extent 16 8, a learning rate of 0.0005, batch size 8, 000, Alt FC coupling, 8 coupling blocks with 4 hidden layers each. A learning rate decay with patience of 3, 000 epochs is applied. We used global scaling, Tanh activation and hidden width 1, 000. As base-density we chose a Normal distribution q0 = N(0, 1). Gradient clipping with norm=1 is applied. Just like in Nicoli et al. (2023), training is done on 50 million samples. Optimization is performed for 48h on a single A100 each, which leeds to up to 1.5 million steps for the standard gradient estimators and 1.1 million epochs with the fast path algorithm. |