Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Authors: Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For the experiments, we focus on testing an instruction-guided openended answer generation ability. Previous studies primarily evaluated diffusion LMs using natural language understanding (NLU) benchmarks that are measured with multiple-choice metrics or string matching metrics [29]. However, this makes it difficult to assess the biggest challenge of diffusion LMs: fluent open-ended text generation. We observe that previous diffusion LMs [29] which showed remarkable scores on NLU tasks (e.g., MMLU [16]) perform poorly on benchmarks to evaluate open-ended answer generation such as Alpaca Eval [44], often generating broken sentences. Therefore, we use benchmarks like Alpaca Eval and the metric of G-Eval [26], which is known as most aligning with human evaluation, showing that our method achieves state-of-the-art performance among diffusion LM baselines, with even around one-third the step size of previous work Figure 1. For bidirectional generation, we provide a proof of concept through sampling patterns (Figure 7). Since most LM tasks assume a unidirectional scenario, benchmark experiments are deferred to future work.
Researcher Affiliation	Academia	Yeongbin Seo Dongha Lee Jaehyung Kim Jinyoung Yeo Department of Artificial Intelligence Yonsei University EMAIL
Pseudocode	Yes	Algorithm 1: Corruption for Instruction Tuning Data
Open Source Code	Yes	The code is available online (https://github.com/ybseo-ac/Conv).
Open Datasets	Yes	For the experiments, we focus on testing an instruction-guided openended answer generation ability. Previous studies primarily evaluated diffusion LMs using natural language understanding (NLU) benchmarks that are measured with multiple-choice metrics or string matching metrics [29]. However, this makes it difficult to assess the biggest challenge of diffusion LMs: fluent open-ended text generation. We observe that previous diffusion LMs [29] which showed remarkable scores on NLU tasks (e.g., MMLU [16]) perform poorly on benchmarks to evaluate open-ended answer generation such as Alpaca Eval [44], often generating broken sentences. Therefore, we use benchmarks like Alpaca Eval and the metric of G-Eval [26], which is known as most aligning with human evaluation, showing that our method achieves state-of-the-art performance among diffusion LM baselines, with even around one-third the step size of previous work Figure 1. For bidirectional generation, we provide a proof of concept through sampling patterns (Figure 7). Since most LM tasks assume a unidirectional scenario, benchmark experiments are deferred to future work.
Dataset Splits	Yes	All baselines are trained using standard finetuning (SFT) on the Alpaca instruction dataset [44]. Subsequently, we apply R2FT to a subset of models, as proposed in 4.2. Then, various decoding strategies are applied: In Table 1, categorical refers to categorical sampling, the default decoding strategy implemented in [36]. LLADA represents the ideal decoding setup including semi-AR with stride 512 as described in the paper of [29]. Top-k + Glob corresponds to top-k decoding with global normalization (k = 20 in our paper) proposed in A.3, and Conv refers to convolution decoding ( 4.1). Since speed is one of the key advantages of diffusion LLMs, all models follow the default setting proposed in the [36]: a decoding window size L = 1024 and total steps S = 128, a highly compressed generation process. We mainly test on the checkpoint (180M) from [36], and to assess consistency in larger models, we also report experiments on the LLa DA-8B-Base checkpoint from Nie et al. [29]. For fine-tuning LLa DA, we use Lo RA adapter with the optimal hyperparameters provided in the original paper [20].
Hardware Specification	Yes	For small models we mainly conducted experiments on, we used NVIDIA A5000 (24GB) GPUs. The per-GPU batch size was 4. SFT required approximately 190 8 GPU-minutes for 35 epochs, with an average GPU memory allocation of 60R2FT required 20 8 GPU-minutes for 300 steps, also with 60% memory usage. For large models, we used NVIDIA A6000 (48GB) GPUs. The per-GPU batch size was 1. SFT took approximately 220 8 GPU-minutes for 800 steps, with 80% average memory allocation. R2FT required 90 8 GPU-minutes for 200 steps, with 100% memory allocation.
Software Dependencies	No	No specific software dependencies with version numbers are explicitly mentioned in the main body or appendices of the paper.
Experiment Setup	Yes	For SFT, we use a global batch size of 512, the learning rate of 3e-5, 2500 warm-up steps, and Adam W optimizer across 8 GPUs. For the small model, the loss value converged around 33 epochs. The large model converged faster and was trained for 3 epochs, following the optimal setting of [29]. We use the same hyperparameters for R2FT as SFT, but it is trained only for 200 300 steps, resulting in a peak learning rate of around 5e-6. The small model (182M) was fully fine-tuned from the pretrained checkpoint of [36], while the large model is trained on LLADA-8B-base with a 80M Lo RA adapter for both SFT and R2FT. Throughout our work, decoding window size L is set to 1024, and step size S to 128, following [36].