reproducibilityindex.ai

Multicalibration for Confidence Scoring in LLMs

Authors: Gianluca Detommaso, Martin Andres Bertran, Riccardo Fogliato, Aaron Roth

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive experimental comparison of the methodologies introduced in Sections 2, 3, 4.1 and 4.2. We conduct experiments on a range of question answering datasets, namely Big Bench (Ghazal et al., 2013), MMLU (Hendrycks et al., 2020), Open Book QA (Mihaylov et al., 2018), Truthful QA (Lin et al., 2021), Math QA (Amini et al., 2019), and Trivia QA (Joshi et al., 2017). These datasets enable us to assess the methods across a heterogeneous collection of queries over which the probability of hallucination varies substantially. We assess the outcomes using several state-of-the-art LLMs, namely Stable Beluga-13B (Touvron et al., 2023; Mukherjee et al., 2023), Flan-T5-base (Chung et al., 2022), Bloomz-7b1 (Muennighoff et al., 2022), and Mistral-7B-v0.1 (Jiang et al., 2023). The goal is to provide a comprehensive understanding of how these methods perform across several datasets and LLMs.
Researcher Affiliation	Collaboration	1AWS AI 2University of Pennsylvania. Correspondence to: Gianluca Detommaso <detommaso.gianluca@gmail.com>.
Pseudocode	Yes	Algorithm 1 Histogram Binning (HB), Algorithm 2 Group-Conditional Unbiased Regression, Algorithm 3 Iterative Grouped Histogram Binning (IGHB), Algorithm 4 Linear Scaling (LS), Algorithm 5 Iterative Grouped Linear Binning (IGLB)
Open Source Code	No	The paper does not provide any specific links to open-source code for the methodology described, nor does it explicitly state that the code is publicly available.
Open Datasets	Yes	We conduct experiments on a range of question answering datasets, namely Big Bench (Ghazal et al., 2013), MMLU (Hendrycks et al., 2020), Open Book QA (Mihaylov et al., 2018), Truthful QA (Lin et al., 2021), Math QA (Amini et al., 2019), and Trivia QA (Joshi et al., 2017).
Dataset Splits	Yes	The data is then randomly split into calibration and testing sets, with an 80/20 split. [...] Algorithm 5: Split D into Dcalib and Dval.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions tools like UAE-Large-V1 and UMAP with citations, but it does not provide specific version numbers for software dependencies (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	No	The paper describes the general approach and algorithms but does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) for training or model configuration.