Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Authors: Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our method on challenging reasoning benchmarks, including GSM8K, MATH, Human Eval, and MBPP. We show that applying KLASS with large-scale masked diffusion models not only halves the number of sampling steps compared to standard greedy or Top-k decoding [19], but also achieves higher accuracy, achieving state-of-the-art results compared to other diffusion samplers. Figure 1b, correct samples consistently exhibit significantly lower KL scores than incorrect ones, for all models and datasets.
Researcher Affiliation	Academia	Seo Hyun Kim1 Sunwoo Hong1 Hojung Jung1 Youngrok Park1 Se-Young Yun1 1KAIST AI EMAIL
Pseudocode	Yes	We provide a pseudocode of our algorithm with further analysis in Appendix B. (Appendix B title: Pseudo Code, containing Algorithm 1: KL-Adaptive Stability Sampling (KLASS))
Open Source Code	Yes	Our code is available at https://github.com/shkim0116/KLASS.
Open Datasets	Yes	We empirically validate our method on challenging reasoning benchmarks, including GSM8K [10], MATH [16], Human Eval [9], and MBPP [2]. We evaluate KLASS on Masked Diffusion Language Model (MDLM) [34] pre-trained on the Open Web Text corpus [13]. We evaluate KLASS on the MMa DA (Multimodal Large Diffusion Language Models) [49], a multimodal diffusion foundation model... between our 10,000 generated samples and the Image Net validation set. We use QM9 [31], which contains molecules with up to nine heavy atoms, represented in SMILES [46].
Dataset Splits	Yes	We evaluate on four reasoning benchmarks: GSM8K [10] and MATH500 [16] for math, and Human Eval [9] and MBPP-sanitized [2] for code synthesis... a small validation set of around 100 examples. For each sampler, we generate 10,000 class-conditional images with uniformly sampled Image Net labels... FID is computed between our 10,000 generated samples and the Image Net validation set.
Hardware Specification	Yes	All sampling experiments are conducted on a single NVIDIA RTX A5000 GPU. All runs were executed on a single NVIDIA RTX A6000 GPU. All image-generation runs use a single NVIDIA RTX A5000. We utilize a single RTX 3090 GPU for both training and the inference.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA, etc.) used for the experiments.
Experiment Setup	Yes	For both models we set the generation length to 256 tokens, with LLa DA using a block size of 64. The generation temperature is set to 0 for LLa DA and 0.2 for Dream. In KLASS, we compute per-token KL divergence over a history length of n = 2, and apply KL thresholds ranging from 0.001 to 0.01 and confidence thresholds from 0.5 to 0.9. Full configuration details and a lightweight guideline for hyperparameter selection are provided in Appendix D.1.2. (Table 7 provides specific thresholds). For all diffusion-based methods, we generate 1,000 sequences of length 1,024 tokens under a fixed 512-step schedule, applying nucleus (top-p) filtering at p = 0.9, a history length n = 2, a KL divergence threshold ϵKL = 1e 4, and a confidence threshold τ = 0.57. For KLASS, we fix the hyperparameters to history length n = 1, KL divergence threshold ϵKL = 0.3, and confidence threshold τ = 0.1. We use diffusion step size 32, taking 25,000 gradient steps. We train the model with classifier-free guidance (CFG) training with dropout condition probability of 0.1. We generate samples with CFG strength γ = 1.