Thermometer: Towards Universal Calibration for Large Language Models

Authors: Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory W. Wornell, Soumya Ghosh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method. We empirically evaluate THERMOMETER on diverse benchmarks and models and find it consistently produce better-calibrated uncertainties than competing methods at a fraction of the computational cost.
Researcher Affiliation Collaboration 1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA 2MITIBM Watson AI Lab, IBM Research.
Pseudocode Yes Algorithm 1 summarizes our learning procedure.
Open Source Code Yes 1The code is available at https://github.com/ maohaos2/Thermometer.
Open Datasets Yes We employ two widely used benchmark datasets for multiple-choice question-and-answer (QA) experiments: MMLU (Hendrycks et al., 2020) and BIG-bench (Srivastava et al., 2022), and adopt MRQA (Fisch et al., 2019) for experiments on QA with free form answers.
Dataset Splits Yes We train the model using K 1 datasets, and test the trained model on the remaining single testing task, repeating this process for all the K datasets. For the QA tasks with free form answers, we use the established train and dev splits. We train THERMOMETER on MRQA s train split, and evaluate on the six held-out development datasets.
Hardware Specification Yes All experiments are implemented in Py Torch using Tesla V100 GPU with 32 GB memory and Tesla A100 GPU with 40 GB memory.
Software Dependencies No The paper mentions "implemented in Py Torch" but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes The input dimension of THERMOMETER is set to be 2048 and 4096 for FLAN-T5-XL and LLa MA-2-Chat 7B , respectively. Correspondingly, the dimensions of the hidden layers in THERMOMETER are set at 256 for FLAN-T5-XL and 512 for LLa MA-2-Chat 7B . To ensure that the output of THERMOMETER remains positive, a Softplus activation function is adopted. The optimization of THERMOMETER utilizes the Adam W optimizer, and all the hyper-parameter used for training is summarized in Table 16. Table 16: Batch Size Nb 128 Epochs (M) 5000 Checkpoint (m) 50 lr (γ) 1e-3 Weight Decay 1e-4 λreg 1e-2 Prior α0 1.25 Prior β0 4.0