Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking

Authors: Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas S Nolte, Brian Karrer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of the EB-Sampler on standard code and math reasoning generation tasks and on logic puzzles solving. The empirical findings in this section support our theoretical derivations and demonstrate the proposed sampler s capabilities.
Researcher Affiliation	Industry	Heli Ben-Hamu Itai Gat Daniel Severo Niklas Nolte Brian Karrer FAIR, Meta AI
Pseudocode	Yes	Figure 4: Python code implementation of a single sampling step for common Top-k approaches and for EB-Sampler.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We evaluate open source models on public benchmarks. Minimal code implementation of the core method is shown in Figure 4 and experimental details are listed in Appendix.
Open Datasets	Yes	We use 4 widely used benchmarks. Human Eval (0 shot) (Chen et al., 2021) and MBPP (4 shot) (Austin et al., 2021b) code generation benchmarks; and GSM8K (8 shot) (Cobbe et al., 2021) and Math (4 shot) (Hendrycks et al., 2021) math reasoning benchmarks.
Dataset Splits	Yes	We generate 48K mazes for training and 2K for validation. All mazes are defined on a grid of size 10x10. ... We adopt the standard 9 w 9 Sudoku setting and adapt the code from Alp (2024) to generate 48K training puzzles and 2K held-out puzzles with, all with unique solutions.
Hardware Specification	Yes	All benchmarks were run in the same computational setting, on 8 H100.
Software Dependencies	No	The paper includes Python code in Figure 4 and mentions PyTorch, but no specific version numbers for any software dependencies are provided.
Experiment Setup	Yes	All benchmarks were run with the same γ range of values for EB-Sampler, γ {0, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 1.5}, and with same k range of values for Top-k sampling, k {1, 2, 4, 8}. All benchmarks were run in the same computational setting, on 8 H100. We report runtimes for confidence and entropy error proxies in Table 3. Runtimes with margin error proxy are longer due to the need to sort over the vocabulary size to compute the top-2 tokens. Runtimes for LLa Da 8B are longer than Dream 7B due to having twice the maximal sequence length of Dream.