Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference
Authors: Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, Volodymyr Kuleshov
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for Efficient Encoder-Decoder Diffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks. |
| Researcher Affiliation | Academia | Department of Computer Science, Cornell University EMAIL, EMAIL |
| Pseudocode | Yes | In Algorithm 1, we present the sampling procedure for E2D2 applied to the block diffusion parameterization, which generates block-by-block while enabling efficient KV caching. ... The training procedure for E2D2 is presented in Algorithm 2. |
| Open Source Code | Yes | We provide the code1, model weights, and blog post on the project page: https://m-arriola.com/e2d2. 1Code: https://github.com/kuleshov-group/e2d2 |
| Open Datasets | Yes | Datasets We examine 1) text summarization (CNN/Daily Mail; [27, 57]) for which we compute ROUGE scores [37], 2) machine translation (WMT 14 de-en; [6]) for which we compute the BLEU [50] score, and 3) mathematical reasoning (GSM8K; [10]) for which we compute zero-shot pass@1 accuracy. We also train E2D2 on the widely used pretraining Open Web Text dataset [21]. |
| Dataset Splits | Yes | Data For this task, we use the CNN/Daily Mail dataset version 3.0 [27, 57] downloaded from https://huggingface.co/datasets/abisee/cnn_dailymail. Data was pre-processed to add a prefix to summarizations: Summary: . Inputs and targets were truncated to a maximum length of 512 each, ensuring a maximum sequence length of 1024 for sequences seen during training. ... Since OWT does not have a validation split, we leave the last 100k documents for validation. |
| Hardware Specification | Yes | Decoding throughput (Tput) is measured in tokens / sec on 1 H100 80GB machine. For all models, we use T = L sampling steps, so the throughput can be higher for diffusion when T < L. We report mean standard deviation for 100 samples. ... For the ablation results, we used a single A100 (80GB) GPU. |
| Software Dependencies | No | The paper lists several software packages and their licenses in Table 11 of Appendix F, such as "Hugging Face [71] Apache 2.0" and "Py Torch [51] BSD-3 Clause". However, it does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | Hyperparameters We used the Qwen/Qwen3-0.6B-Base tokenizer. All models had hidden size of 256 and intermediate hidden size of 768. The AR and MDLM baselines consisted of 28 transformer layers, corresponding to 80M parameters. BD3LM had 12 layers and used a block size of S = 8, corresponding to 60M parameters. E2D2 also used S = 8 and consisted of 20 encoder and 8 decoder layers, corresponding to 80M parameters. We used the last hidden state variant of the E2D2 model. We trained with batch size 128. Learning rate was linearly warmed-up for 1000 steps until a maximum of 3e 4. Models were trained for a maximum of 500k steps and we use early stopping on the validation loss to select the best model. For all of our experiments, we used the ADAM optimizer [31] with weight decay 1e 5 and (β1, β2) = (0.9, 0.98). We also apply gradient clipping to 1.0. |