Large Language Models Must Be Taught to Know What They Don’t Know

Authors: Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, Andrew G. Wilson

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using Lo RA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.
Researcher Affiliation Collaboration Sanyam Kapoor New York University Nate Gruver* New York University Manley Roberts Abacus AI Katherine Collins Cambridge University Arka Pal Abacus AI Umang Bhatt New York University Adrian Weller Cambridge University Samuel Dooley Abacus AI Micah Goldblum Columbia University Andrew Gordon Wilson New York University
Pseudocode No The paper describes methodologies in text and provides figures, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes 2https://github.com/activatedgeek/calibration-tuning. In the NeurIPS paper checklist, section 5, the authors state: 'Answer: [Yes] Justification: We provide the complete code, and the complete list of datasets used for all experiments in Appendix C.2 to reproduce all our experiments with instructions. All hyperparameters are described in Section 5.'
Open Datasets Yes For training, we build a diverse set of samples from a collection of benchmark datasets, similar to instruction-tuning [57]. From the list of 16 benchmark datasets in Appendix C.2, we use a sampled subset of size approximately 20,000. Appendix C.2 lists: AI2 Reasoning Challenge (ARC) [12], Boolean Questions (Bool Q) [11], Commonsense QA [47], Cosmos QA [21], Hella Swag [62], Math QA [2], Recognizing Textual Entailment (RTE/SNLI) [8], Adversarial NLI [38], Open Book QA [36], PIQA [7], Sci Q [58], The Commitment Bank (CB) [14], Multi-Sentence Reading Comprehension (Multi RC) [27], Choice of Plausible Alternatives (Co PA) [16], TREC [31], Adversarial Winograd (Winogrande) [45].
Dataset Splits Yes We hold out 2000 data-points to use as a temperature scaling calibration set [17].
Hardware Specification Yes Each training run takes approximately 1-3 GPU days with 4 NVIDIA RTX8000 (48GB) GPUs.
Software Dependencies No We use Hugging Face Transformers [59] and Py Torch [41] for the implementation of these models. While software is mentioned, specific version numbers for Transformers and PyTorch are not provided, which is necessary for full reproducibility of software dependencies.
Experiment Setup Yes For fine-tuning, we use 8-bit quantization and Low-Rank Adapters (Lo RA) [20]. For Lo RA, we keep the default hyperparameters: rank r = 8, α = 32, and dropout probability 0.1. ... We use the Adam W optimizer [34] with a learning rate of 10 4, a cosine decay schedule, and effective batch size M = 32. The training runs for G = 10000 with an initial linear warmup schedule for 1000 steps.