Score-based Continuous-time Discrete Diffusion Models

Authors: Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, Hanjun Dai

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an empirical evaluation of the proposed diffusion approach on synthetic data, CIFAR10 dataset and the monophonic music dataset. The primary goal is to compare the effectiveness of the proposed categorical ratio matching with the alternative parameterizations presented in Section 5.
Researcher Affiliation Collaboration Haoran Sun Georgia Tech hsun349@gatech.edu Lijun Yu Carnegie Mellon University lijun@cmu.edu Bo Dai Google Research; Georgia Tech bodai@google.com Dale Schuurmans Google Research; University of Alberta schuurmans@google.com Hanjun Dai Google Research hadai@google.com
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the methodology described, nor does it include a direct link to a code repository.
Open Datasets Yes We present an empirical evaluation of the proposed diffusion approach on synthetic data, CIFAR10 dataset and the monophonic music dataset. This benchmark is originally from the Lakh pianoroll dataset (Raffel, 2016; Dong et al., 2018)
Dataset Splits Yes In the end there are 6,000 music sequences for training and 973 sequences for evaluation.
Hardware Specification Yes Experiments are conducted on machines equipped with TPU-v4 chips. We train the model with 4x4x4 TPU-v4 chips, with batch size of 128.
Software Dependencies No The paper mentions using the "Adam optimizer (β1 = 0, β2 = 0.99) (Kingma & Ba (2014))" and references "Mask GIT (Chang et al. (2022))". However, it does not provide specific version numbers for general software components such as Python, PyTorch/TensorFlow, or other libraries.
Experiment Setup Yes The backbone of the neural network is the same as BERT-base, which consists of 12 layers of Transformers, where each layer has 12 attention heads, embedding size of 768 and hidden layer size of 3072 for MLPs. We use a constant uniform rate of 0.007 for the forward process. The learning rate is warmed up from 0 to 1e-4 during the first 3% steps of training, and then decays to 0 in a linear schedule. The final evaluation is done on the exponential moving average of the model parameters, with the decay rate of 0.999. Each Transformer component has 6 layers with embedding size of 256, 8 attention heads and hidden dimension of 2048.