reproducibilityindex.ai

Calibrating Reasoning in Language Models with Internal Consistency

Authors: Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical studies across different models and datasets demonstrate that internal consistency effectively distinguishes between correct and incorrect reasoning paths. Motivated by this, we propose a new approach to calibrate reasoning by up-weighting reasoning paths with high internal consistency, resulting in a significant boost in reasoning performance.
Researcher Affiliation	Collaboration	Zhihui Xie Jizhou Guo Tong Yu Shuai Li Shanghai Jiao Tong University Adobe Research The University of Hong Kong zhxieml@gmail.com shuaili8@sjtu.edu.cn
Pseudocode	No	The paper describes the methods through textual explanations and mathematical equations, such as Equation 2 for Internal Consistency, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Due to policy constraints, we are unable to provide the code with our submission. We provide sufficient implementation details for readers to reproduce our results.
Open Datasets	Yes	Bool Q (Clark et al., 2019): A reading comprehension dataset where each instance involves a yes/no question grounded in a related passage. ... Coin Flip (Wei et al., 2022): ... Pr Onto QA (Saparov and He, 2023): ... Proof Writer (Tafjord et al., 2020):
Dataset Splits	Yes	We split the dataset randomly by 80%/20% into training and validation subsets.
Hardware Specification	Yes	We performed all experiments on a compute node with 8 Nvidia GPU cards and 512 GB of memory.
Software Dependencies	No	Following Radford et al. (2021), we use the Scikit-learn package (Pedregosa et al., 2011) and determine the L2 regularization strength λ using a hyperparameter sweep over the range between 10 6 and 106 for logistic regression.
Experiment Setup	Yes	In our few-shot Co T experiments, we use Nucleus sampling (Holtzman et al., 2019) with a temperature of 0.7 and a top-p of 0.95 to generate reasoning paths.