Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Authors: Zifeng Cheng, Jinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the Truthful QA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines.
Researcher Affiliation	Academia	Zifeng Cheng , Jinwei Gan , Zhiwei Jiang , Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, guq@.nju.edu.cn
Pseudocode	Yes	Algorithm 1 The overall ﬂow of FASB
Open Source Code	No	Our code will be released at https://github.com/gjw185/FASB. and We will release the data and code after the paper is accepted.
Open Datasets	Yes	Truthful QA (Lin et al., 2022) dataset includes open-ended generation task and multiple-choice task. For the multiple-choice tasks, we use datasets: COPA (Gordon et al., 2012), Story Cloze (Mostafazadeh et al., 2016), NLI (Bowman et al., 2015), MMLU (Hendrycks et al., 2021), SST2 (Socher et al., 2013), and Winogrande (Sakaguchi et al., 2020).
Dataset Splits	No	The paper mentions datasets like Truthful QA, COPA, Story Cloze, NLI, MMLU, SST2, Winogrande, Natural Questions, Trivia QA, Real Toxicity Prompts, and Wiki Hop, and an ablation study varying the 'Training Set Size' using percentages of the original dataset. However, it does not explicitly provide details about specific training, validation, and test splits (e.g., exact percentages, sample counts, or citations to specific predefined splits) for these datasets in the experimental settings.
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA A800 GPUs.
Software Dependencies	No	The paper mentions specific LLM models used (e.g., LLaMA2-7B-CHAT, Qwen2.5-7B) but does not provide specific version numbers for software dependencies or libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	In the Probe method, for the Truthful QA dataset, we intervene using the top-24 heads, set the threshold range to [0.4, 0.5], the number of backtracking steps to 10, and search for the intervention strength in the range of [40, 80] with a step size of 10. For the six multiple-choice tasks, our threshold search range is [0.3, 0.4, 0.5, 0.6], the intervention strength search range is [0, 250] with a step size of 10, and the number of backtracking steps is 10.