reproducibilityindex.ai

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Authors: Aaron Lou, Chenlin Meng, Stefano Ermon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. We now empirically validate that our score entropy discrete diffusion (SEDD) model on a variety of language modeling tasks. We measure both perplexity (i.e. likelihood estimation capabilities) as well as generation quality, finding that our method performs quite well in both aspects.
Researcher Affiliation	Collaboration	Aaron Lou 1 Chenlin Meng 1 2 Stefano Ermon 1 1Stanford University 2Pika Labs.
Pseudocode	Yes	The paper includes 'Algorithm 1 Score Entropy Training Loop (Multiple Dimensions)', 'Algorithm 2 Score Entropy Sampling (Unconditional)', and 'Algorithm 3 Score Entropy Sampling (Conditional)' in Appendix B.
Open Source Code	Yes	We open source our code at github.com/louaaron/Score Entropy-Discrete-Diffusion
Open Datasets	Yes	We compare on the text8 dataset, a small, character level language modeling task. We follow Austin et al. (2021) for network hyperparameters and dataset splits... We also test SEDD on One Billion Words, a more medium sized and real world dataset. We follow He et al. (2022) for the tokenization, training, and model size configurations. We train on Open Web Text as the original Web Text dataset has not been made available (this is typical practice and does not meaningfully affect results in practice) (Gokaslan & Cohen, 2019) and test on the LAMBADA, Wiki Text2, PTB, Wiki Text103, and One Billion Words datasets (which were all of the GPT-2 zero-shot tasks that measured perplexity).
Dataset Splits	Yes	We follow Austin et al. (2021) for network hyperparameters and dataset splits and compare with methods that employ a similar model size. We also matched architecture hyperparameters with prior work... as well as the same data splits.
Hardware Specification	Yes	We trained on nodes of 8 A100 80GB or 16 A100 40GB GPUs, using gradient accumulation when our batch size did not fit into memory (as is the case for SEDD medium).
Software Dependencies	No	The paper mentions using 'flash attention' and the 'huggingface transformers library' but does not specify their version numbers, which is required for reproducible software dependencies.
Experiment Setup	Yes	All models were trained with a batch size of 512 and trained with a learning rate of 3 10 4. We clip our gradient norm to 1 and have a linear warmup schedule for the first 2000 iterations. We also use a 0.9999 EMA.