Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Remasking Discrete Diffusion Models with Inference-Time Scaling

Authors: Guanghan Wang, Yair Schiff, Subham Sahoo, Volodymyr Kuleshov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across the domains of natural language, discretized images, and molecule string representations, we demonstrate empirically that Re MDM endows masked diffusion with inferencetime scaling that improves sample quality with more computation and enhances controlled generation and downstream performance.
Researcher Affiliation Academia Guanghan Wang Yair Schiff Subham Sekhar Sahoo Volodymyr Kuleshov Cornell Tech, Cornell University EMAIL
Pseudocode Yes The high-level pseudocode for the Re MDM sampler is provided in Algorithm 1, with more detailed algorithms for implementing the schedules below deferred to Appendix C.
Open Source Code Yes We provide the code along with a blog post on the project page: https://remdm.github.io
Open Datasets Yes We test the text generation capability of Re MDM with unconditional generation from models trained on Open Web Text (OWT; Gokaslan & Cohen [14]). The OWT dataset was tokenized using the gpt-2 tokenizer [38] and sequences were wrapped to a max length of L = 1024. Experimental Setup We test Re MDM s class-conditioned image generation ability. Concretely, we use a pretrained Mask Gi T [7] that was trained on Image Net [8] samples with 256 256 pixels. Experimental Setup We follow the setup from Schiff et al. [46] to explore controlled small molecule generation. Specifically, we use the QM9 dataset [40, 39] of 133k molecules and their characterbased SMILES string representations [59]. To test the effect of remasking sampling on d LLLMs, we take LLa DA 8B Instruct and apply DFM and Re MDM samplers (see Appendix D.4 for sampler details). Specifically, we use Countdown (bidirectional reasoning) [63] and Truthful QA (factual knowledge grasp) [27] as benchmarking tasks.
Dataset Splits Yes We use the same train-validation split as in Sahoo et al. [41] (where the last 100k documents of OWT were designated as the validation set) and randomly select 5,000 samples from the validation set to serve as the reference for MAUVE score computation.
Hardware Specification No We gratefully acknowledge their generous GPU infrastructure grants that helped make this work possible.
Software Dependencies No In Table 10, we list the software packages (and corresponding licenses) used in this work. Table 10: Software (and corresponding license) used in this work. Library License Hugging Face [60] Apache 2.0 Jax [4] Apache 2.0 Num Py [15] Num Py license Py Torch [34] BSD-3 Clause Py Torch Lightning [11] Apache 2.0 RDKit [23] BSD 3-Clause New" or Revised" Seaborn [58] BSD 3-Clause New" or Revised" MAUVE [36, 37] GPLv3 Language Model Evaluation Harness [12] MIT
Experiment Setup Yes In this experiment, we reuse the pretrained AR, SEDD, and MDLM checkpoints released by [41] where the diffusion models are trained using a log-linear schedule, i.e., αt = 1 − t. AR, SEDD, and MDLM share the same architecture: a Transformer-based model [56] that augments the diffusion transformer [35] with rotary embeddings [55] and consists of 169M parameters. The neural network is comprised of 12 layers, 12 attention heads, and 768 hidden dimensions. Please see Sahoo et al. [41] for the full model architecture and training details.