Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Authors: Zifeng Cheng, Jinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the Truthful QA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. |
| Researcher Affiliation | Academia | Zifeng Cheng , Jinwei Gan , Zhiwei Jiang , Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, guq@.nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 The overall flow of FASB |
| Open Source Code | No | Our code will be released at https://github.com/gjw185/FASB. and We will release the data and code after the paper is accepted. |
| Open Datasets | Yes | Truthful QA (Lin et al., 2022) dataset includes open-ended generation task and multiple-choice task. For the multiple-choice tasks, we use datasets: COPA (Gordon et al., 2012), Story Cloze (Mostafazadeh et al., 2016), NLI (Bowman et al., 2015), MMLU (Hendrycks et al., 2021), SST2 (Socher et al., 2013), and Winogrande (Sakaguchi et al., 2020). |
| Dataset Splits | No | The paper mentions datasets like Truthful QA, COPA, Story Cloze, NLI, MMLU, SST2, Winogrande, Natural Questions, Trivia QA, Real Toxicity Prompts, and Wiki Hop, and an ablation study varying the 'Training Set Size' using percentages of the original dataset. However, it does not explicitly provide details about specific training, validation, and test splits (e.g., exact percentages, sample counts, or citations to specific predefined splits) for these datasets in the experimental settings. |
| Hardware Specification | Yes | All experiments are conducted on 4 NVIDIA A800 GPUs. |
| Software Dependencies | No | The paper mentions specific LLM models used (e.g., LLaMA2-7B-CHAT, Qwen2.5-7B) but does not provide specific version numbers for software dependencies or libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In the Probe method, for the Truthful QA dataset, we intervene using the top-24 heads, set the threshold range to [0.4, 0.5], the number of backtracking steps to 10, and search for the intervention strength in the range of [40, 80] with a step size of 10. For the six multiple-choice tasks, our threshold search range is [0.3, 0.4, 0.5, 0.6], the intervention strength search range is [0, 250] with a step size of 10, and the number of backtracking steps is 10. |