Thermometer: Towards Universal Calibration for Large Language Models
Authors: Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory W. Wornell, Soumya Ghosh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method. We empirically evaluate THERMOMETER on diverse benchmarks and models and find it consistently produce better-calibrated uncertainties than competing methods at a fraction of the computational cost. |
| Researcher Affiliation | Collaboration | 1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA 2MITIBM Watson AI Lab, IBM Research. |
| Pseudocode | Yes | Algorithm 1 summarizes our learning procedure. |
| Open Source Code | Yes | 1The code is available at https://github.com/ maohaos2/Thermometer. |
| Open Datasets | Yes | We employ two widely used benchmark datasets for multiple-choice question-and-answer (QA) experiments: MMLU (Hendrycks et al., 2020) and BIG-bench (Srivastava et al., 2022), and adopt MRQA (Fisch et al., 2019) for experiments on QA with free form answers. |
| Dataset Splits | Yes | We train the model using K 1 datasets, and test the trained model on the remaining single testing task, repeating this process for all the K datasets. For the QA tasks with free form answers, we use the established train and dev splits. We train THERMOMETER on MRQA s train split, and evaluate on the six held-out development datasets. |
| Hardware Specification | Yes | All experiments are implemented in Py Torch using Tesla V100 GPU with 32 GB memory and Tesla A100 GPU with 40 GB memory. |
| Software Dependencies | No | The paper mentions "implemented in Py Torch" but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The input dimension of THERMOMETER is set to be 2048 and 4096 for FLAN-T5-XL and LLa MA-2-Chat 7B , respectively. Correspondingly, the dimensions of the hidden layers in THERMOMETER are set at 256 for FLAN-T5-XL and 512 for LLa MA-2-Chat 7B . To ensure that the output of THERMOMETER remains positive, a Softplus activation function is adopted. The optimization of THERMOMETER utilizes the Adam W optimizer, and all the hyper-parameter used for training is summarized in Table 16. Table 16: Batch Size Nb 128 Epochs (M) 5000 Checkpoint (m) 50 lr (γ) 1e-3 Weight Decay 1e-4 λreg 1e-2 Prior α0 1.25 Prior β0 4.0 |