Denoising Diffusion Samplers
Authors: Francisco Vargas, Will Sussman Grathwohl, Arnaud Doucet
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present here experiments for Algorithm 1. In our implementation, fθ follows the PIS-GRAD network proposed in (Zhang & Chen, 2022): fθ(k, x) = NN1(k, x; θ) + NN2(k; θ) ln π(x). Across all experiments we use a two layer architecture with 64 hidden units each (for both networks), as in Zhang & Chen (2022), with the exception of the NICE (Dinh et al., 2014) target where we use 3 layers with 512, 256, and 64 units respectively. The final layers are initialised to 0 in order to make the path regularisation term null. We use α1/2 k α1/2 max cos2 π 2 1 k/K+s 1+s with s = 0.008 as in (Nichol & Dhariwal, 2021). We found that detaching the target score stabilised optimization in both approaches without affecting the final result. We adopt this across experiments, an ablation of this feature can be seen in Appendix C.9.1. Across all tasks we compare DDS to SMC (Del Moral et al., 2006; Zhou et al., 2016), PIS (Barr et al., 2020; Vargas et al., 2021; Zhang & Chen, 2022), and Mean Field-VI (MF-VI) with a Gaussian variational distribution. We also compare DDS to AIS (Neal, 2001) and optimized variants of AIS using score matching (MCD) (Doucet et al., 2022; Geffner & Domke, 2022) for two standard Bayesian models. Finally we explore a task introduced in (Doucet et al., 2022) that uses a pre-trained normalising flow as a target. Within this setting we propose a benchmarking criterion that allows us to assess mode collapse in high dimensions and explore the benefits of incorporating inductive biases into fθ. We carefully tuned the hyper-parameters of all algorithms (e.g. step size, diffusion coefficient, and such), details can be found in Appendix C.2. Finally training time can be found in Appendix C.5. Additional experiments for the normalizing flows are presented in Appendix C.11 and for the underdamped approach in Appendix C.12. |
| Researcher Affiliation | Collaboration | Francisco Vargas1 , Will Grathwohl2 & Arnaud Doucet2 1 University of Cambridge, 2 Deep Mind |
| Pseudocode | Yes | Algorithm 1 DDS Training |
| Open Source Code | No | No explicit statement about providing open-source code for the described methodology or a direct link to the paper's specific code repository. |
| Open Datasets | Yes | Normalizing Flow Evaluation: Following Doucet et al. (2022) we train NICE (Dinh et al., 2014) on a down-sampled d = 14 14 variant of MNIST (Le Cun & Cortes, 2010) and use the trained model as our target. ... Funnel Distribution: This 10-dimensional challenging distribution is given by γ(x1:10) = N(x1; 0, σ2 f)N(x2:10; 0, exp(x1)I), where σ2 f = 9 (Neal, 2003). ... Log Gaussian Cox process: This model arises in spatial statistics (Møller et al., 1998). ... Logistic Regression: We set x N(0, σ2 w I), yi Bernoulli(sigmoid(x ui)). This Bayesian logistic model is evaluated on two datasets, Ionosphere (d = 32) and Sonar (d = 61). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages, counts, or references to predefined splits for reproduction. |
| Hardware Specification | Yes | We evaluate the training times of DDS and PIS on an 8-chip TPU circuit and average over 99 runs. |
| Software Dependencies | No | We compute an unbiased gradient of this objective using the reparameterization trick and the JAX software package (Bradbury et al., 2018). ... Finally we trained with Adam (Kingma & Ba, 2015)... which are b1 = 0.9, b2 = 0.999, ϵ = 10 8 naming as per Kingma & Ba (2015). From the github repository of Zhang & Chen (2022) ... Optax library (Hessel et al., 2020)... |
| Experiment Setup | Yes | In our implementation, fθ follows the PIS-GRAD network proposed in (Zhang & Chen, 2022): fθ(k, x) = NN1(k, x; θ) + NN2(k; θ) ln π(x). Across all experiments we use a two layer architecture with 64 hidden units each (for both networks), as in Zhang & Chen (2022), with the exception of the NICE (Dinh et al., 2014) target where we use 3 layers with 512, 256, and 64 units respectively. The final layers are initialised to 0 in order to make the path regularisation term null. We use α1/2 k α1/2 max cos2 π 2 1 k/K+s 1+s with s = 0.008 as in (Nichol & Dhariwal, 2021). ... For both PIS and DDS we created a grid with δ = K T = 0.05 and values of T {3.4, 6.4, 12.8, 25.6} and corresponding number of steps K {64, 128, 128, 256}. ...Finally both PIS and DDS where trained with Adam (Kingma & Ba, 2015) to at most 11000 iterations... Across all experiments modulo the Funnel we use the Adam optimiser with a learning rate of 0.0001 with no learning decay and 11000 training iterations, for the rest of the optimisation parameters use the default settings as provided by the Optax library (Hessel et al., 2020) which are b1 = 0.9, b2 = 0.999, ϵ = 10 8 naming as per Kingma & Ba (2015). |