Calibrating Reasoning in Language Models with Internal Consistency
Authors: Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical studies across different models and datasets demonstrate that internal consistency effectively distinguishes between correct and incorrect reasoning paths. Motivated by this, we propose a new approach to calibrate reasoning by up-weighting reasoning paths with high internal consistency, resulting in a significant boost in reasoning performance. |
| Researcher Affiliation | Collaboration | Zhihui Xie Jizhou Guo Tong Yu Shuai Li Shanghai Jiao Tong University Adobe Research The University of Hong Kong zhxieml@gmail.com shuaili8@sjtu.edu.cn |
| Pseudocode | No | The paper describes the methods through textual explanations and mathematical equations, such as Equation 2 for Internal Consistency, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Due to policy constraints, we are unable to provide the code with our submission. We provide sufficient implementation details for readers to reproduce our results. |
| Open Datasets | Yes | Bool Q (Clark et al., 2019): A reading comprehension dataset where each instance involves a yes/no question grounded in a related passage. ... Coin Flip (Wei et al., 2022): ... Pr Onto QA (Saparov and He, 2023): ... Proof Writer (Tafjord et al., 2020): |
| Dataset Splits | Yes | We split the dataset randomly by 80%/20% into training and validation subsets. |
| Hardware Specification | Yes | We performed all experiments on a compute node with 8 Nvidia GPU cards and 512 GB of memory. |
| Software Dependencies | No | Following Radford et al. (2021), we use the Scikit-learn package (Pedregosa et al., 2011) and determine the L2 regularization strength λ using a hyperparameter sweep over the range between 10 6 and 106 for logistic regression. |
| Experiment Setup | Yes | In our few-shot Co T experiments, we use Nucleus sampling (Holtzman et al., 2019) with a temperature of 0.7 and a top-p of 0.95 to generate reasoning paths. |