Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Authors: Aaron Lou, Chenlin Meng, Stefano Ermon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. We now empirically validate that our score entropy discrete diffusion (SEDD) model on a variety of language modeling tasks. We measure both perplexity (i.e. likelihood estimation capabilities) as well as generation quality, finding that our method performs quite well in both aspects.
Researcher Affiliation Collaboration Aaron Lou 1 Chenlin Meng 1 2 Stefano Ermon 1 1Stanford University 2Pika Labs.
Pseudocode Yes The paper includes 'Algorithm 1 Score Entropy Training Loop (Multiple Dimensions)', 'Algorithm 2 Score Entropy Sampling (Unconditional)', and 'Algorithm 3 Score Entropy Sampling (Conditional)' in Appendix B.
Open Source Code Yes We open source our code at github.com/louaaron/Score Entropy-Discrete-Diffusion
Open Datasets Yes We compare on the text8 dataset, a small, character level language modeling task. We follow Austin et al. (2021) for network hyperparameters and dataset splits... We also test SEDD on One Billion Words, a more medium sized and real world dataset. We follow He et al. (2022) for the tokenization, training, and model size configurations. We train on Open Web Text as the original Web Text dataset has not been made available (this is typical practice and does not meaningfully affect results in practice) (Gokaslan & Cohen, 2019) and test on the LAMBADA, Wiki Text2, PTB, Wiki Text103, and One Billion Words datasets (which were all of the GPT-2 zero-shot tasks that measured perplexity).
Dataset Splits Yes We follow Austin et al. (2021) for network hyperparameters and dataset splits and compare with methods that employ a similar model size. We also matched architecture hyperparameters with prior work... as well as the same data splits.
Hardware Specification Yes We trained on nodes of 8 A100 80GB or 16 A100 40GB GPUs, using gradient accumulation when our batch size did not fit into memory (as is the case for SEDD medium).
Software Dependencies No The paper mentions using 'flash attention' and the 'huggingface transformers library' but does not specify their version numbers, which is required for reproducible software dependencies.
Experiment Setup Yes All models were trained with a batch size of 512 and trained with a learning rate of 3 10 4. We clip our gradient norm to 1 and have a linear warmup schedule for the first 2000 iterations. We also use a 0.9999 EMA.