Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Authors: Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art d LLM.
Researcher Affiliation Collaboration Siyan Zhao UCLA Devaansh Gupta UCLA Qinqing Zheng Meta AI Aditya Grover UCLA
Pseudocode Yes Algorithm 1 diffu-GRPO: Policy Gradient Optimization for Masked d LLMs Algorithm 2 Supervised Finetuning of LLa DA
Open Source Code Yes Our code is released at https://dllm-reasoning.github.io/.
Open Datasets Yes We conduct experiments on six reasoning tasks in three categories: (1) Mathematical reasoning: we use GSM8K [10], a dataset of multi-step grade school math problems, and MATH500 [23]... (2) Planning: this includes two tasks: 4x4 Sudoku puzzles... (3) Coding: comprises of two benchmarks; Human Eval [8], a suite of 164 hand-crafted Python algorithmic programming problems and MBPP [6]... For the coding model, we train on the Kod Code-Light-RL-10k10 dataset.
Dataset Splits Yes For SFT, we train on s1k [28] for 20 epochs, with a sequence length of 4096. For RL, we train a separate model for each task. More specifically, for GSM8K, MATH500, we train on the training split; for Countdown and Sudoku, we train on synthetic generated datasets. ... For the GSM8K dataset, we conduct RL on the training split of the GSM8K dataset 6and evaluate on the test split.
Hardware Specification Yes For diffu-GRPO on gsm8k, math, countdown and sudoku tasks, training was conducted on 8 NVIDIA A100-80G GPUs... For diffu-GRPO on coding task, training was conducted on 4 NVIDIA RTX A5000
Software Dependencies No The paper mentions 'TRL library [43]', 'Adam W optimizer [25]', and 'Flash Attention 2 [11]' but does not provide specific version numbers for these software components.
Experiment Setup Yes For diffu-GRPO on gsm8k, math, countdown and sudoku tasks, training was conducted on 8 NVIDIA A100-80G GPUs, with the following hyperparameters: sequence length of 256 tokens, batch size of 6 per GPU, and gradient accumulation steps of 2. We optimized the model using the Adam W optimizer [25], with parameters β1 = 0.9, β2 = 0.99, weight decay of 0.1, learning rate of 3 10 6, and gradient clipping at 0.2.