Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
Authors: Pengxiang Li, Shilin Yan, Jiayin Cai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically evaluate the effectiveness of Adaptive Classifier-Free Guidance (A-CFG). We first describe our experimental setup, including datasets, baseline models, evaluation metrics, and key implementation details. We then present quantitative results from Table 1, comparing LLa DA with A-CFG against LLa DA with standard CFG, LLa DA without guidance, and other stateof-the-art models. Subsequently, we conduct ablation studies to analyze the impact of A-CFG s core hyperparameter. Finally, we provide qualitative examples to illustrate the behavior and benefits of our proposed method. |
| Researcher Affiliation | Collaboration | 1Poly U 2Alibaba 3THU 4CUHK 5PKU 6ICL |
| Pseudocode | Yes | Algorithm 1 Adaptive Classifier-Free Guidance (A-CFG) for one generation step k |
| Open Source Code | Yes | Code is available at https://github.com/pixeli99/A-CFG. |
| Open Datasets | Yes | We evaluate A-CFG on a diverse suite of standard benchmarks covering general language understanding, mathematical and scientific reasoning, and planning tasks. General Language Understanding: MMLU (Massive Multitask Language Understanding) [12], BBH (Big-Bench Hard) [34], ARC-C (AI2 Reasoning Challenge Challenge Set) [7], Hellaswag [44], Truthful QA [21], Wino Grande [31], and PIQA (Physical Interaction QA) [4]. Mathematics & Science Reasoning: GSM8K (Grade School Math 8K) [8], MATH [13], and GPQA (Graduate-Level Google-Proof Q&A) [28]. Planning Tasks: Countdown [42] and Sudoku [42]. |
| Dataset Splits | Yes | Evaluation mode. Closed-form tasks supply a prompt with a finite set of candidate answers; we compute each candidate s conditional log-likelihood and select the most likely. Open-ended tasks require free-form generation; we sample responses and score them with task-specific metrics such as exact-match accuracy. Likelihood estimation. For likelihood-based evaluations we approximate the conditional perplexity bound with Monte-Carlo sampling. A single sample suffices when only one target token is queried (e.g. MMLU). We adopt the same setting as LLa DA, for all other multiple-token tasks we draw 128 samples, which we found to stabilise variance without adding prohibitive cost. |
| Hardware Specification | Yes | All experiments were conducted using NVIDIA H800 GPUs. |
| Software Dependencies | No | The paper mentions using Python, but no specific versions for Python or any other libraries like PyTorch, TensorFlow, or CUDA are provided. |
| Experiment Setup | Yes | For LLa DA s iterative generation, we use 256 sampling steps with low-confidence remasking. For Standard CFG, the guidance scale w was selected from {0.5, 1.0, 1.5, 2.0} based on performance on the validation set of each respective task. For our A-CFG, the guidance scale w was similarly tuned. Once a value of w is chosen for a given model, the same w is kept fixed across all downstream benchmarks for that model. The adaptive re-masking proportion ρ (determining the fraction of previously generated tokens to re-mask based on low confidence, as defined in Section 3.2.1) was set to 0.7. The confidence for token selection in A-CFG is based on the softmax probability of the predicted token at each masked position. Generation hyper-parameters. Unless otherwise stated, we set the answer length to 256 tokens and run the reverse diffusion process for 256 steps (one token revealed per step). |