Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

Authors: Pengxiang Li, Shilin Yan, Jiayin Cai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically evaluate the effectiveness of Adaptive Classifier-Free Guidance (A-CFG). We first describe our experimental setup, including datasets, baseline models, evaluation metrics, and key implementation details. We then present quantitative results from Table 1, comparing LLa DA with A-CFG against LLa DA with standard CFG, LLa DA without guidance, and other stateof-the-art models. Subsequently, we conduct ablation studies to analyze the impact of A-CFG s core hyperparameter. Finally, we provide qualitative examples to illustrate the behavior and benefits of our proposed method.
Researcher Affiliation	Collaboration	1Poly U 2Alibaba 3THU 4CUHK 5PKU 6ICL
Pseudocode	Yes	Algorithm 1 Adaptive Classifier-Free Guidance (A-CFG) for one generation step k
Open Source Code	Yes	Code is available at https://github.com/pixeli99/A-CFG.
Open Datasets	Yes	We evaluate A-CFG on a diverse suite of standard benchmarks covering general language understanding, mathematical and scientific reasoning, and planning tasks. General Language Understanding: MMLU (Massive Multitask Language Understanding) [12], BBH (Big-Bench Hard) [34], ARC-C (AI2 Reasoning Challenge Challenge Set) [7], Hellaswag [44], Truthful QA [21], Wino Grande [31], and PIQA (Physical Interaction QA) [4]. Mathematics & Science Reasoning: GSM8K (Grade School Math 8K) [8], MATH [13], and GPQA (Graduate-Level Google-Proof Q&A) [28]. Planning Tasks: Countdown [42] and Sudoku [42].
Dataset Splits	Yes	Evaluation mode. Closed-form tasks supply a prompt with a finite set of candidate answers; we compute each candidate s conditional log-likelihood and select the most likely. Open-ended tasks require free-form generation; we sample responses and score them with task-specific metrics such as exact-match accuracy. Likelihood estimation. For likelihood-based evaluations we approximate the conditional perplexity bound with Monte-Carlo sampling. A single sample suffices when only one target token is queried (e.g. MMLU). We adopt the same setting as LLa DA, for all other multiple-token tasks we draw 128 samples, which we found to stabilise variance without adding prohibitive cost.
Hardware Specification	Yes	All experiments were conducted using NVIDIA H800 GPUs.
Software Dependencies	No	The paper mentions using Python, but no specific versions for Python or any other libraries like PyTorch, TensorFlow, or CUDA are provided.
Experiment Setup	Yes	For LLa DA s iterative generation, we use 256 sampling steps with low-confidence remasking. For Standard CFG, the guidance scale w was selected from {0.5, 1.0, 1.5, 2.0} based on performance on the validation set of each respective task. For our A-CFG, the guidance scale w was similarly tuned. Once a value of w is chosen for a given model, the same w is kept fixed across all downstream benchmarks for that model. The adaptive re-masking proportion ρ (determining the fraction of previously generated tokens to re-mask based on low confidence, as defined in Section 3.2.1) was set to 0.7. The confidence for token selection in A-CFG is based on the softmax probability of the predicted token at each masked position. Generation hyper-parameters. Unless otherwise stated, we set the answer length to 256 tokens and run the reverse diffusion process for 256 steps (one token revealed per step).