Simple and Effective Masked Diffusion Language Models
Authors: Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, Volodymyr Kuleshov
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. |
| Researcher Affiliation | Academia | Subham Sekhar Sahoo Cornell Tech, NYC, USA. ssahoo@cs.cornell.edu Marianne Arriola Cornell Tech, NYC, USA. ma2238@cornell.edu Yair Schiff Cornell Tech, NYC, USA. yzs2@cornell.edu Aaron Gokaslan Cornell Tech, NYC, USA. akg87@cs.cornell.edu Edgar Marroquin Cornell Tech, NYC, USA. emm392@cornell.edu Justin T Chiu Cornell Tech, NYC, USA. jtc257@cornell.edu Alexander Rush Cornell Tech, NYC, USA. ar459@cornell.edu Volodymyr Kuleshov Cornell Tech, NYC, USA. kuleshov@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Training MDLM 1: repeat 2: x1:L q(x) Sample a sentence. 3: t U[0,1] Sample a time step. 4: zℓ t Cat(zℓ t;αtxℓ+(1 αt)m) 1 ℓ L Mask Each token xℓindependently to obtain the latent z1:L t . 5: Take gradient descent step on L NELBO =Eq ℓ log xℓ θ(z1:L t ,t),xℓ dt 6: until converged |
| Open Source Code | Yes | We provide the code1, along with a blog post and video tutorial2 on the project page: https://s-sahoo.com/mdlm 1code: https://github.com/kuleshov-group/mdlm |
| Open Datasets | Yes | For language modeling likelihood evaluation, we conduct experiments on two datasets: The One Billion Words Dataset (LM1B; [8]) and Open Web Text (OWT; [18]). |
| Dataset Splits | Yes | Since Open Web Text does not have a validation split, we leave the last 100k docs as validation. |
| Hardware Specification | Yes | We conduct all experiments on 8x 3090s, 8x A6000s, 8x A100s, or 8x H100s. The largest models on Open Web Text take 2 weeks to train on 8x A100, the LM1B models only take 2 days to train on the same hardware. |
| Software Dependencies | No | The paper mentions 'bert-base-uncased tokenizer' and 'GPT2 tokenizer [45]' but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | We use 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128 when applicable. Word embeddings are not tied between the input and output. We use the Adam W optimizer with a batch size of 512, constant learning rate warmup from 0 to a learning rate of 3e-4 for 2,500 steps. We use a constant learning rate for 1M, 5M, or 10M steps on One Billion Words, and 1M steps for Open Web Text. We use a dropout rate of 0.1. |