Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Remasking Discrete Diffusion Models with Inference-Time Scaling
Authors: Guanghan Wang, Yair Schiff, Subham Sahoo, Volodymyr Kuleshov
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across the domains of natural language, discretized images, and molecule string representations, we demonstrate empirically that Re MDM endows masked diffusion with inferencetime scaling that improves sample quality with more computation and enhances controlled generation and downstream performance. |
| Researcher Affiliation | Academia | Guanghan Wang Yair Schiff Subham Sekhar Sahoo Volodymyr Kuleshov Cornell Tech, Cornell University EMAIL |
| Pseudocode | Yes | The high-level pseudocode for the Re MDM sampler is provided in Algorithm 1, with more detailed algorithms for implementing the schedules below deferred to Appendix C. |
| Open Source Code | Yes | We provide the code along with a blog post on the project page: https://remdm.github.io |
| Open Datasets | Yes | We test the text generation capability of Re MDM with unconditional generation from models trained on Open Web Text (OWT; Gokaslan & Cohen [14]). The OWT dataset was tokenized using the gpt-2 tokenizer [38] and sequences were wrapped to a max length of L = 1024. Experimental Setup We test Re MDM s class-conditioned image generation ability. Concretely, we use a pretrained Mask Gi T [7] that was trained on Image Net [8] samples with 256 256 pixels. Experimental Setup We follow the setup from Schiff et al. [46] to explore controlled small molecule generation. Specifically, we use the QM9 dataset [40, 39] of 133k molecules and their characterbased SMILES string representations [59]. To test the effect of remasking sampling on d LLLMs, we take LLa DA 8B Instruct and apply DFM and Re MDM samplers (see Appendix D.4 for sampler details). Specifically, we use Countdown (bidirectional reasoning) [63] and Truthful QA (factual knowledge grasp) [27] as benchmarking tasks. |
| Dataset Splits | Yes | We use the same train-validation split as in Sahoo et al. [41] (where the last 100k documents of OWT were designated as the validation set) and randomly select 5,000 samples from the validation set to serve as the reference for MAUVE score computation. |
| Hardware Specification | No | We gratefully acknowledge their generous GPU infrastructure grants that helped make this work possible. |
| Software Dependencies | No | In Table 10, we list the software packages (and corresponding licenses) used in this work. Table 10: Software (and corresponding license) used in this work. Library License Hugging Face [60] Apache 2.0 Jax [4] Apache 2.0 Num Py [15] Num Py license Py Torch [34] BSD-3 Clause Py Torch Lightning [11] Apache 2.0 RDKit [23] BSD 3-Clause New" or Revised" Seaborn [58] BSD 3-Clause New" or Revised" MAUVE [36, 37] GPLv3 Language Model Evaluation Harness [12] MIT |
| Experiment Setup | Yes | In this experiment, we reuse the pretrained AR, SEDD, and MDLM checkpoints released by [41] where the diffusion models are trained using a log-linear schedule, i.e., Îąt = 1 â t. AR, SEDD, and MDLM share the same architecture: a Transformer-based model [56] that augments the diffusion transformer [35] with rotary embeddings [55] and consists of 169M parameters. The neural network is comprised of 12 layers, 12 attention heads, and 768 hidden dimensions. Please see Sahoo et al. [41] for the full model architecture and training details. |