Autoregressive Diffusion Models

Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically we demonstrate that ARDMs perform similarly to or better than discrete diffusion models while being more efficient in modelling steps. Performance of these methods is presented in Table 1.
Researcher Affiliation Collaboration Emiel Hoogeboom , Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans Google Research e.hoogeboom@uva.nl,{agritsenko,pooleb,bastings,salimans}@google.com, riannevdberg@gmail.com
Pseudocode Yes Algorithm 1 Sampling from OA-ARDMsAlgorithm 2 Optimizing OA-ARDMsAlgorithm 3 Sampling from Upscale-ARDMsAlgorithm 4 Optimizing Upscale-ARDMs
Open Source Code Yes Next to given descriptions, the implementation has been open-sourced at https://github.com/ google-research/google-research/tree/master/autoregressive_diffusion.
Open Datasets Yes Order Agnostic Modelling To better understand how ARDMs compare to other order agnostic generative models, we study their performance on a character modelling task using the text8 dataset (Mahoney, 2011). This pattern translates to CIFAR-10 (Krizhevsky et al., 2009) where ARDMs also outperform D3PMs and degrade more gracefully under fewer steps. For audio experiments we used a subset of the SC09 dataset (Warden, 2018).
Dataset Splits Yes For CIFAR10 (Krizhevsky et al., 2009) we train the model using a fixed number of steps using the typical splits. For the text8 dataset (Mahoney, 2011) we train using the typical 90 106/5 106/5 106 splits in characters. The resulting dataset contains 31158/3643/4107 training/validation/test audio clips
Hardware Specification Yes The runs take approximately 2 weeks to complete training on 8 TPUv4 devices, although good performance (≈ 2.8 bits per dimension) is already achieved after a couple of days. The runs take approximately a week to complete on 4 TPUv4 devices.
Software Dependencies No The paper mentions using JAX and NumPy (in pseudocode context) and TensorFlow Datasets, but does not provide specific version numbers for these software components.
Experiment Setup Yes The models are trained for 3000 epochs with Adam using a learning rate of 0.0001 and beta parameters (0.9 / 0.999). The models are trained with a batch size of 128. The gradient is clipped at 100. ARDMs are optimized with Adam with a learning rate of 0.0005 which has a linear warm-up for the first 5000 steps. The additional LCE loss was included with a factor 0.0001. The gradient is clipped at 0.25. Audio AO-ARDM and Upscale ARDM models were trained using the Adam optimizer (Kingma & Ba, 2014) with beta paramters 0.9 / 0.999 for 106 steps with a batch size of 256 and a linear learning rate warm-up over the first 15000 steps followed by a constant learning rate of 10-4.