Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QA-Calibration of Language Model Confidence Scores

Authors: Putra Manggala, Atalanti A Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, Aaditya Ramdas

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS Datasets, Models, and Prompts. We use 5 QA datasets: Trivia QA (Joshi et al., 2017), Sci Q (Welbl et al., 2017), Big Bench (Srivastava et al., 2022), Open Book QA (Mihaylov et al., 2018), and MMLU (Hendrycks et al., 2021) (see Table 4 for more details). We use two performant models: Mistral (Jiang et al., 2023) and Gemma (Team et al., 2024). To elicit confidence scores, we use two prompt techniques recently suggested in literature: Verb1S-Top1 & Ling1S-Top1 from Tian et al. (2023). See Table 5, (Appendix B.1) for details about the prompts. [...] Results. Table 2 shows the performance of the posthoc calibrators on MMLU and Big Bench datasets. More results are provided in Tables 6 and 7.
Researcher Affiliation	Collaboration	Putra Manggala University of Amsterdam EMAIL Atalanti Mastakouri, Elke Kirschbaum & Shiva Prasad Kasiviswanathan Amazon Aaditya Ramdas Amazon and Carnegie Mellon University
Pseudocode	Yes	Algorithm 1: QA binning: Train-time Subroutine... Algorithm 2: QA binning: Test-time Subroutine... Algorithm 3: Scaling QA binning: Train-time Subroutine... Algorithm 4: UMD
Open Source Code	No	The paper mentions using and citing open-source projects like DistilBERT and XLNet (e.g., "Pretrained models and code are available at https://github.com/zihangdai/xlnet. cite arxiv:1906.08237"), but it does not provide any specific link or explicit statement for the open-sourcing of the authors' own implementation of QA binning or scaling QA binning.
Open Datasets	Yes	We use 5 QA datasets: Trivia QA (Joshi et al., 2017), Sci Q (Welbl et al., 2017), Big Bench (Srivastava et al., 2022), Open Book QA (Mihaylov et al., 2018), and MMLU (Hendrycks et al., 2021) (see Table 4 for more details).
Dataset Splits	Yes	Training. We perform a 4-way (20:60:10:10) split of each dataset: the first is used to construct the kd-tree, second is used for posthoc calibration training, third is used for hyperparameter tuning and fourth is for testing.
Hardware Specification	Yes	Compute Resources. The experiments were run using a 3090Ti GPU and 64 GB of RAM.
Software Dependencies	No	The paper mentions using specific models like Distil BERT and XLNet, and programming languages implicitly through the algorithms, but it does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python version) used to implement their methods.
Experiment Setup	Yes	To set the hyperparameter minimum number of points per bin b (Algorithm 1), we set an ϵ that is not too large as per Figure 2 and use root finding with the ϵ expression in Theorem 3.1 to choose b. We then search over a range of b s by allowing for a misspecification range between 0 and 0.05 and a range of maximum kd-tree depths depending on the size of the dataset such that each partition admits a 3 10 bins. To set B in UMD, we follow the guidelines in Gupta & Ramdas (2021). ... We set the LM temperature to close to 0 to minimize output stochasticity and set max tokens to be able to process the prompt m(q).