Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

Authors: Kyle Cox, Jiawei Xu, Yikun Han, Rong Xu, Tianhao Li, Chi-Yang Hsu, Tianlong Chen, Walter Gerych, Ying Ding

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Perturbation Method Let Q(Y \| x, θ) be the response distribution of the language model at an input x and parameterization θ. ... We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023).
Researcher Affiliation	Academia	Kyle Cox1, Jiawei Xu1, Yikun Han2, Rong Xu1, Tianhao Li1, Chi-Yang Hsu1, Tianlong Chen3, Walter Gerych4, Ying Ding1 1University of Texas at Austin 2University of Michigan-Ann Arbor 3University of North Carolina at Chapel Hill 4MIT EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and formulas using mathematical notation and descriptive text, but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks or sections.
Open Source Code	Yes	Code https://github.com/xocelyk/paraphrase-uncertainty
Open Datasets	Yes	Datasets. We use two question-answering datasets: Trivia QA (Joshi et al. 2017) and Natural Questions (NQ) (Kwiatkowski et al. 2019).
Dataset Splits	Yes	Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset.
Hardware Specification	Yes	The experiments were conducted on eight RTX A6000 GPUs.
Software Dependencies	No	The paper mentions specific LLM models used (GPT-3.5 (gpt-3.5-turbo-0125), Llama 2-Base (7B), and Llama 2-Chat (7B)) but does not provide version numbers for ancillary software dependencies like programming languages or libraries.
Experiment Setup	Yes	Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset. The number of perturbations (np) and the number of samples (ns) were experimented with in different pairs. We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023). For GPT-3.5, we sample with temperature 1. For Llama models, we sample with temperature 0.6.