Autoregressive Diffusion Models
Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we demonstrate that ARDMs perform similarly to or better than discrete diffusion models while being more efficient in modelling steps. Performance of these methods is presented in Table 1. |
| Researcher Affiliation | Collaboration | Emiel Hoogeboom , Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans Google Research e.hoogeboom@uva.nl,{agritsenko,pooleb,bastings,salimans}@google.com, riannevdberg@gmail.com |
| Pseudocode | Yes | Algorithm 1 Sampling from OA-ARDMsAlgorithm 2 Optimizing OA-ARDMsAlgorithm 3 Sampling from Upscale-ARDMsAlgorithm 4 Optimizing Upscale-ARDMs |
| Open Source Code | Yes | Next to given descriptions, the implementation has been open-sourced at https://github.com/ google-research/google-research/tree/master/autoregressive_diffusion. |
| Open Datasets | Yes | Order Agnostic Modelling To better understand how ARDMs compare to other order agnostic generative models, we study their performance on a character modelling task using the text8 dataset (Mahoney, 2011). This pattern translates to CIFAR-10 (Krizhevsky et al., 2009) where ARDMs also outperform D3PMs and degrade more gracefully under fewer steps. For audio experiments we used a subset of the SC09 dataset (Warden, 2018). |
| Dataset Splits | Yes | For CIFAR10 (Krizhevsky et al., 2009) we train the model using a fixed number of steps using the typical splits. For the text8 dataset (Mahoney, 2011) we train using the typical 90 106/5 106/5 106 splits in characters. The resulting dataset contains 31158/3643/4107 training/validation/test audio clips |
| Hardware Specification | Yes | The runs take approximately 2 weeks to complete training on 8 TPUv4 devices, although good performance (≈ 2.8 bits per dimension) is already achieved after a couple of days. The runs take approximately a week to complete on 4 TPUv4 devices. |
| Software Dependencies | No | The paper mentions using JAX and NumPy (in pseudocode context) and TensorFlow Datasets, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The models are trained for 3000 epochs with Adam using a learning rate of 0.0001 and beta parameters (0.9 / 0.999). The models are trained with a batch size of 128. The gradient is clipped at 100. ARDMs are optimized with Adam with a learning rate of 0.0005 which has a linear warm-up for the first 5000 steps. The additional LCE loss was included with a factor 0.0001. The gradient is clipped at 0.25. Audio AO-ARDM and Upscale ARDM models were trained using the Adam optimizer (Kingma & Ba, 2014) with beta paramters 0.9 / 0.999 for 106 steps with a batch size of 256 and a linear learning rate warm-up over the first 15000 steps followed by a constant learning rate of 10-4. |