Multicalibration for Confidence Scoring in LLMs

Authors: Gianluca Detommaso, Martin Andres Bertran, Riccardo Fogliato, Aaron Roth

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive experimental comparison of the methodologies introduced in Sections 2, 3, 4.1 and 4.2. We conduct experiments on a range of question answering datasets, namely Big Bench (Ghazal et al., 2013), MMLU (Hendrycks et al., 2020), Open Book QA (Mihaylov et al., 2018), Truthful QA (Lin et al., 2021), Math QA (Amini et al., 2019), and Trivia QA (Joshi et al., 2017). These datasets enable us to assess the methods across a heterogeneous collection of queries over which the probability of hallucination varies substantially. We assess the outcomes using several state-of-the-art LLMs, namely Stable Beluga-13B (Touvron et al., 2023; Mukherjee et al., 2023), Flan-T5-base (Chung et al., 2022), Bloomz-7b1 (Muennighoff et al., 2022), and Mistral-7B-v0.1 (Jiang et al., 2023). The goal is to provide a comprehensive understanding of how these methods perform across several datasets and LLMs.
Researcher Affiliation Collaboration 1AWS AI 2University of Pennsylvania. Correspondence to: Gianluca Detommaso <detommaso.gianluca@gmail.com>.
Pseudocode Yes Algorithm 1 Histogram Binning (HB), Algorithm 2 Group-Conditional Unbiased Regression, Algorithm 3 Iterative Grouped Histogram Binning (IGHB), Algorithm 4 Linear Scaling (LS), Algorithm 5 Iterative Grouped Linear Binning (IGLB)
Open Source Code No The paper does not provide any specific links to open-source code for the methodology described, nor does it explicitly state that the code is publicly available.
Open Datasets Yes We conduct experiments on a range of question answering datasets, namely Big Bench (Ghazal et al., 2013), MMLU (Hendrycks et al., 2020), Open Book QA (Mihaylov et al., 2018), Truthful QA (Lin et al., 2021), Math QA (Amini et al., 2019), and Trivia QA (Joshi et al., 2017).
Dataset Splits Yes The data is then randomly split into calibration and testing sets, with an 80/20 split. [...] Algorithm 5: Split D into Dcalib and Dval.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions tools like UAE-Large-V1 and UMAP with citations, but it does not provide specific version numbers for software dependencies (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup No The paper describes the general approach and algorithms but does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) for training or model configuration.