reproducibilityindex.ai

Autoregressive Diffusion Models

Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically we demonstrate that ARDMs perform similarly to or better than discrete diffusion models while being more efﬁcient in modelling steps. Performance of these methods is presented in Table 1.
Researcher Affiliation	Collaboration	Emiel Hoogeboom , Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans Google Research e.hoogeboom@uva.nl,{agritsenko,pooleb,bastings,salimans}@google.com, riannevdberg@gmail.com
Pseudocode	Yes	Algorithm 1 Sampling from OA-ARDMsAlgorithm 2 Optimizing OA-ARDMsAlgorithm 3 Sampling from Upscale-ARDMsAlgorithm 4 Optimizing Upscale-ARDMs
Open Source Code	Yes	Next to given descriptions, the implementation has been open-sourced at https://github.com/ google-research/google-research/tree/master/autoregressive_diffusion.
Open Datasets	Yes	Order Agnostic Modelling To better understand how ARDMs compare to other order agnostic generative models, we study their performance on a character modelling task using the text8 dataset (Mahoney, 2011). This pattern translates to CIFAR-10 (Krizhevsky et al., 2009) where ARDMs also outperform D3PMs and degrade more gracefully under fewer steps. For audio experiments we used a subset of the SC09 dataset (Warden, 2018).
Dataset Splits	Yes	For CIFAR10 (Krizhevsky et al., 2009) we train the model using a ﬁxed number of steps using the typical splits. For the text8 dataset (Mahoney, 2011) we train using the typical 90 106/5 106/5 106 splits in characters. The resulting dataset contains 31158/3643/4107 training/validation/test audio clips
Hardware Specification	Yes	The runs take approximately 2 weeks to complete training on 8 TPUv4 devices, although good performance (≈ 2.8 bits per dimension) is already achieved after a couple of days. The runs take approximately a week to complete on 4 TPUv4 devices.
Software Dependencies	No	The paper mentions using JAX and NumPy (in pseudocode context) and TensorFlow Datasets, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The models are trained for 3000 epochs with Adam using a learning rate of 0.0001 and beta parameters (0.9 / 0.999). The models are trained with a batch size of 128. The gradient is clipped at 100. ARDMs are optimized with Adam with a learning rate of 0.0005 which has a linear warm-up for the ﬁrst 5000 steps. The additional LCE loss was included with a factor 0.0001. The gradient is clipped at 0.25. Audio AO-ARDM and Upscale ARDM models were trained using the Adam optimizer (Kingma & Ba, 2014) with beta paramters 0.9 / 0.999 for 106 steps with a batch size of 256 and a linear learning rate warm-up over the ﬁrst 15000 steps followed by a constant learning rate of 10-4.