Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Constrained Discrete Diffusion

Authors: Michael Cardei, Jacob K Christopher, Bhavya Kailkhura, Tom Hartvigsen, Ferdinando Fioretto

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion demonstrate that CDD achieves zero constraint violations in a diverse array of tasks while preserving fluency, novelty, and coherence, while outperforming autoregressive and existing discrete diffusion approaches.
Researcher Affiliation	Collaboration	Michael Cardei University of Virginia EMAIL Jacob K. Christopher University of Virginia EMAIL Thomas Hartvigsen University of Virginia EMAIL Bhavya Kailkhura Lawrence Livermore National Laboratory EMAIL Ferdinando Fioretto University of Virginia EMAIL
Pseudocode	Yes	Algorithm 1: Augmented Lagrangian Init: λ, µ, η, α, δ; Input: probability distribution x t y x t; while g(y ) < δ do for j 1 to max_inner_iter do LKL DKL(x t y) Lviol Pm i=1 λi gi( ϕ(y)) + µi 2 gi( ϕ(y))2 LALM LKL + Lviol y y η y LALM λ λ + µ g(y ) µ min(αµ, µmax) xs y; Output: xs
Open Source Code	No	Code will be released following the review process.
Open Datasets	Yes	Specifically, CDD is evaluated on three representative use cases illustrated in Figure 1: (i) toxicity mitigation with tunable thresholds for safe LLM deployment, (ii) property adherence for molecular sequence generation, and (iii) instruction-following tasks requiring precise symbolic constraints. We select prefixes with toxicity score above 0.5, testing how these models perform on toxic inputs, and perplexity below 30.0, avoiding incoherent generations. To model the constraints, GPT-Neo [43] is finetuned on the Jigsaw Toxic Comment Classification Challenge4. For our molecule generation experiments, we utilize the QM9 dataset [47], as used by Schiff et al. [8]. This dataset comprises approximately 133,885 small organic molecules, each encoded as a SMILES string [44] Finally, the last application focuses on satisfying user-specified symbolic constraints, a task that many autoregressive models struggle with. We evaluate CDD against such models using two types of constraints. First, we assess the ability of the tested models to accurately answer how many letters are in a word, such as How many R s are in Strawberry? . We additionally test lexical constraints based of the COLLIE dataset [49], prompting the models to generate word-indexed constrained sentences, such as Generate a sentence where the thirdto-last word is mankind .
Dataset Splits	Yes	We evaluate on 1,000 samples, and discuss ALM hyper-parameters in Appendix C, runtime in Appendix D, and diversity as measured by entropy [10] in Appendix E.
Hardware Specification	Yes	For all test setting we evaluate with the same subset of 200 random samples, and all are run on a single NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies.
Experiment Setup	Yes	For the ALM projection, we initialize using the following hyperparameters: λinit = 0.0, µinit = 1.0, outer_itermax = 1000, inner_itermax = 10, η = 0.20. We choose these setting for the toxicity experiment as they result in the lowest perplexity in the Appendix C ablation study. For natural language toxicity reduction details we use a generation length of 100 tokens.