Calibrating Reasoning in Language Models with Internal Consistency

Authors: Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies across different models and datasets demonstrate that internal consistency effectively distinguishes between correct and incorrect reasoning paths. Motivated by this, we propose a new approach to calibrate reasoning by up-weighting reasoning paths with high internal consistency, resulting in a significant boost in reasoning performance.
Researcher Affiliation Collaboration Zhihui Xie Jizhou Guo Tong Yu Shuai Li Shanghai Jiao Tong University Adobe Research The University of Hong Kong zhxieml@gmail.com shuaili8@sjtu.edu.cn
Pseudocode No The paper describes the methods through textual explanations and mathematical equations, such as Equation 2 for Internal Consistency, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Due to policy constraints, we are unable to provide the code with our submission. We provide sufficient implementation details for readers to reproduce our results.
Open Datasets Yes Bool Q (Clark et al., 2019): A reading comprehension dataset where each instance involves a yes/no question grounded in a related passage. ... Coin Flip (Wei et al., 2022): ... Pr Onto QA (Saparov and He, 2023): ... Proof Writer (Tafjord et al., 2020):
Dataset Splits Yes We split the dataset randomly by 80%/20% into training and validation subsets.
Hardware Specification Yes We performed all experiments on a compute node with 8 Nvidia GPU cards and 512 GB of memory.
Software Dependencies No Following Radford et al. (2021), we use the Scikit-learn package (Pedregosa et al., 2011) and determine the L2 regularization strength λ using a hyperparameter sweep over the range between 10 6 and 106 for logistic regression.
Experiment Setup Yes In our few-shot Co T experiments, we use Nucleus sampling (Holtzman et al., 2019) with a temperature of 0.7 and a top-p of 0.95 to generate reasoning paths.