Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Authors: Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, Volodymyr Kuleshov
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate BD3-LMs across standard language modeling benchmarks and demonstrate their ability to generate arbitrary-length sequences unconditionally. We pre-train a base BD3-LM using the maximum block size L = L for 850K gradient steps and fine-tune under varying L for 150K gradient steps on the One Billion Words dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)). |
| Researcher Affiliation | Collaboration | Correspondence to Marianne Arriola: EMAIL Cornell Tech, NY, USA. Stanford University, CA, USA. Cohere, NY, USA. |
| Pseudocode | Yes | Algorithm 1 Block Diffusion Training Algorithm 2 Block Diffusion Sampling |
| Open Source Code | Yes | We provide the code1, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms 1Code: https://github.com/kuleshov-group/bd3lms |
| Open Datasets | Yes | We conduct experiments on two datasets: The One Billion Word Dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)). |
| Dataset Splits | Yes | Models trained on LM1B use the bert-base-uncased tokenizer and a context length of 128. We report perplexities on the test split of LM1B. Models trained on OWT use the GPT2 tokenizer Radford et al. (2019) and a context length of 1024. Since OWT does not have a validation split, we leave the last 100k documents for validation. |
| Hardware Specification | Yes | We use 3090, A5000, A6000, and A100 GPUs. |
| Software Dependencies | Yes | Flex Attention (Dong et al., 2024) is a compiler-driven programming model that enables efficient implementation of attention mechanisms with structured sparsity in Py Torch... significantly less memory with up to 5X speedup over the naive native scaled_dot_product_attention implementation in Py Torch ( 2.5) on a A5000 GPU |
| Experiment Setup | Yes | We use the Adam W optimizer with a batch size of 512 and constant learning rate warmup from 0 to 3e-4 for 2.5K gradient updates. We train a base BD3-LM using the maximum context length L = L for 850K gradient steps. Then, we fine-tune under varying L using the noise schedule optimization for 150K gradient steps on the One Billion Words dataset (LM1B) and Open Web Text (OWT). |