Simple and Effective Masked Diffusion Language Models

Authors: Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, Volodymyr Kuleshov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity.
Researcher Affiliation Academia Subham Sekhar Sahoo Cornell Tech, NYC, USA. ssahoo@cs.cornell.edu Marianne Arriola Cornell Tech, NYC, USA. ma2238@cornell.edu Yair Schiff Cornell Tech, NYC, USA. yzs2@cornell.edu Aaron Gokaslan Cornell Tech, NYC, USA. akg87@cs.cornell.edu Edgar Marroquin Cornell Tech, NYC, USA. emm392@cornell.edu Justin T Chiu Cornell Tech, NYC, USA. jtc257@cornell.edu Alexander Rush Cornell Tech, NYC, USA. ar459@cornell.edu Volodymyr Kuleshov Cornell Tech, NYC, USA. kuleshov@cornell.edu
Pseudocode Yes Algorithm 1 Training MDLM 1: repeat 2: x1:L q(x) Sample a sentence. 3: t U[0,1] Sample a time step. 4: zℓ t Cat(zℓ t;αtxℓ+(1 αt)m) 1 ℓ L Mask Each token xℓindependently to obtain the latent z1:L t . 5: Take gradient descent step on L NELBO =Eq ℓ log xℓ θ(z1:L t ,t),xℓ dt 6: until converged
Open Source Code Yes We provide the code1, along with a blog post and video tutorial2 on the project page: https://s-sahoo.com/mdlm 1code: https://github.com/kuleshov-group/mdlm
Open Datasets Yes For language modeling likelihood evaluation, we conduct experiments on two datasets: The One Billion Words Dataset (LM1B; [8]) and Open Web Text (OWT; [18]).
Dataset Splits Yes Since Open Web Text does not have a validation split, we leave the last 100k docs as validation.
Hardware Specification Yes We conduct all experiments on 8x 3090s, 8x A6000s, 8x A100s, or 8x H100s. The largest models on Open Web Text take 2 weeks to train on 8x A100, the LM1B models only take 2 days to train on the same hardware.
Software Dependencies No The paper mentions 'bert-base-uncased tokenizer' and 'GPT2 tokenizer [45]' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes We use 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128 when applicable. Word embeddings are not tied between the input and output. We use the Adam W optimizer with a batch size of 512, constant learning rate warmup from 0 to a learning rate of 3e-4 for 2,500 steps. We use a constant learning rate for 1M, 5M, or 10M steps on One Billion Words, and 1M steps for Open Web Text. We use a dropout rate of 0.1.