Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SteerConf: Steering LLMs for Confidence Elicitation

Authors: Ziang Zhou, Tianyuan Jin, Jieming Shi, Qing Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on seven benchmarks spanning professional knowledge, common sense, ethics, and reasoning tasks, using advanced LLM models (GPT-3.5, LLa MA 3, GPT-4), demonstrate that Steer Conf significantly outperforms existing methods, often by a significant margin.
Researcher Affiliation	Academia	Ziang Zhou Department of Computing The Hong Kong Polytechnic University Hong Kong, China EMAIL Tianyuan Jin Department of Electrical and Computer Engineering National University of Singapore Singapore EMAIL Jieming Shi Department of Computing The Hong Kong Polytechnic University Hong Kong, China EMAIL Qing Li Department of Computing The Hong Kong Polytechnic University Hong Kong, China EMAIL
Pseudocode	No	The paper describes the framework components with mathematical formulas and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 3 outlines 'Steering Prompts', 'Steered Confidence Consistency', and 'Steered Confidence Calibration'.
Open Source Code	Yes	The implementation is at https://github.com/scottjiao/Steer Conf.
Open Datasets	Yes	Datasets. We assess confidence estimation quality across five categories of reasoning tasks: (1) Commonsense Reasoning using Sports Understanding dataset (Sport) [18] and Strategy QA (Strategy QA) [7] from Big Bench [8]; (2) Arithmetic Reasoning evaluated on GSM8K (GSM8K) [3]; (3) Symbolic Reasoning covering Date Understanding (Date Und) [36] and Object Counting (Obj Cnt) [33]; (4) Professional Knowledge tested through Law (Law) from MMLU [12]; and (5) Ethical Knowledge examined via Business Ethics (Ethics) in MMLU [12].
Dataset Splits	No	The paper mentions using several datasets for evaluation (e.g., Sports Understanding, Strategy QA, GSM8K, etc.) but does not specify how these datasets were split into training, validation, or test sets for the experiments.
Hardware Specification	No	Most experiments are from calling LLM APIs. ... Note that experiments with GPT-4 incurred a cost of approximately 1500 USD due to its higher pricing.
Software Dependencies	No	The paper mentions the use of LLM models like GPT-3.5, LLa MA3, and GPT-4 but does not specify any particular software libraries, frameworks, or their version numbers used for implementing the methodology.
Experiment Setup	Yes	In our setting, we set ℓ= 2, which means we have five steering levels: {very cautious, cautious, vanilla, confident, very confident}, a moderate granularity, which is sufficient to demonstrate the effectiveness of our method. ... For Misleading and Self-Random, we use M = 5 samples; for Top-K, we set K = 5 answer-confidence pairs. ... We provide the detailed prompts used in our experiments for LLMs under both Co T and non-Co T settings.