Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Authors: Aaron Lou, Chenlin Meng, Stefano Ermon
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. We now empirically validate that our score entropy discrete diffusion (SEDD) model on a variety of language modeling tasks. We measure both perplexity (i.e. likelihood estimation capabilities) as well as generation quality, finding that our method performs quite well in both aspects. |
| Researcher Affiliation | Collaboration | Aaron Lou 1 Chenlin Meng 1 2 Stefano Ermon 1 1Stanford University 2Pika Labs. |
| Pseudocode | Yes | The paper includes 'Algorithm 1 Score Entropy Training Loop (Multiple Dimensions)', 'Algorithm 2 Score Entropy Sampling (Unconditional)', and 'Algorithm 3 Score Entropy Sampling (Conditional)' in Appendix B. |
| Open Source Code | Yes | We open source our code at github.com/louaaron/Score Entropy-Discrete-Diffusion |
| Open Datasets | Yes | We compare on the text8 dataset, a small, character level language modeling task. We follow Austin et al. (2021) for network hyperparameters and dataset splits... We also test SEDD on One Billion Words, a more medium sized and real world dataset. We follow He et al. (2022) for the tokenization, training, and model size configurations. We train on Open Web Text as the original Web Text dataset has not been made available (this is typical practice and does not meaningfully affect results in practice) (Gokaslan & Cohen, 2019) and test on the LAMBADA, Wiki Text2, PTB, Wiki Text103, and One Billion Words datasets (which were all of the GPT-2 zero-shot tasks that measured perplexity). |
| Dataset Splits | Yes | We follow Austin et al. (2021) for network hyperparameters and dataset splits and compare with methods that employ a similar model size. We also matched architecture hyperparameters with prior work... as well as the same data splits. |
| Hardware Specification | Yes | We trained on nodes of 8 A100 80GB or 16 A100 40GB GPUs, using gradient accumulation when our batch size did not fit into memory (as is the case for SEDD medium). |
| Software Dependencies | No | The paper mentions using 'flash attention' and the 'huggingface transformers library' but does not specify their version numbers, which is required for reproducible software dependencies. |
| Experiment Setup | Yes | All models were trained with a batch size of 512 and trained with a learning rate of 3 10 4. We clip our gradient norm to 1 and have a linear warmup schedule for the first 2000 iterations. We also use a 0.9999 EMA. |