Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QA-Calibration of Language Model Confidence Scores

Authors: Putra Manggala, Atalanti A Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, Aaditya Ramdas

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS Datasets, Models, and Prompts. We use 5 QA datasets: Trivia QA (Joshi et al., 2017), Sci Q (Welbl et al., 2017), Big Bench (Srivastava et al., 2022), Open Book QA (Mihaylov et al., 2018), and MMLU (Hendrycks et al., 2021) (see Table 4 for more details). We use two performant models: Mistral (Jiang et al., 2023) and Gemma (Team et al., 2024). To elicit confidence scores, we use two prompt techniques recently suggested in literature: Verb1S-Top1 & Ling1S-Top1 from Tian et al. (2023). See Table 5, (Appendix B.1) for details about the prompts. [...] Results. Table 2 shows the performance of the posthoc calibrators on MMLU and Big Bench datasets. More results are provided in Tables 6 and 7.
Researcher Affiliation Collaboration Putra Manggala University of Amsterdam EMAIL Atalanti Mastakouri, Elke Kirschbaum & Shiva Prasad Kasiviswanathan Amazon Aaditya Ramdas Amazon and Carnegie Mellon University
Pseudocode Yes Algorithm 1: QA binning: Train-time Subroutine... Algorithm 2: QA binning: Test-time Subroutine... Algorithm 3: Scaling QA binning: Train-time Subroutine... Algorithm 4: UMD
Open Source Code No The paper mentions using and citing open-source projects like DistilBERT and XLNet (e.g., "Pretrained models and code are available at https://github.com/zihangdai/xlnet. cite arxiv:1906.08237"), but it does not provide any specific link or explicit statement for the open-sourcing of the authors' own implementation of QA binning or scaling QA binning.
Open Datasets Yes We use 5 QA datasets: Trivia QA (Joshi et al., 2017), Sci Q (Welbl et al., 2017), Big Bench (Srivastava et al., 2022), Open Book QA (Mihaylov et al., 2018), and MMLU (Hendrycks et al., 2021) (see Table 4 for more details).
Dataset Splits Yes Training. We perform a 4-way (20:60:10:10) split of each dataset: the first is used to construct the kd-tree, second is used for posthoc calibration training, third is used for hyperparameter tuning and fourth is for testing.
Hardware Specification Yes Compute Resources. The experiments were run using a 3090Ti GPU and 64 GB of RAM.
Software Dependencies No The paper mentions using specific models like Distil BERT and XLNet, and programming languages implicitly through the algorithms, but it does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python version) used to implement their methods.
Experiment Setup Yes To set the hyperparameter minimum number of points per bin b (Algorithm 1), we set an ϵ that is not too large as per Figure 2 and use root finding with the ϵ expression in Theorem 3.1 to choose b. We then search over a range of b s by allowing for a misspecification range between 0 and 0.05 and a range of maximum kd-tree depths depending on the size of the dataset such that each partition admits a 3 10 bins. To set B in UMD, we follow the guidelines in Gupta & Ramdas (2021). ... We set the LM temperature to close to 0 to minimize output stochasticity and set max tokens to be able to process the prompt m(q).