Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

Authors: Kyle Cox, Jiawei Xu, Yikun Han, Rong Xu, Tianhao Li, Chi-Yang Hsu, Tianlong Chen, Walter Gerych, Ying Ding

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments Perturbation Method Let Q(Y | x, θ) be the response distribution of the language model at an input x and parameterization θ. ... We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023).
Researcher Affiliation Academia Kyle Cox1, Jiawei Xu1, Yikun Han2, Rong Xu1, Tianhao Li1, Chi-Yang Hsu1, Tianlong Chen3, Walter Gerych4, Ying Ding1 1University of Texas at Austin 2University of Michigan-Ann Arbor 3University of North Carolina at Chapel Hill 4MIT EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and formulas using mathematical notation and descriptive text, but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks or sections.
Open Source Code Yes Code https://github.com/xocelyk/paraphrase-uncertainty
Open Datasets Yes Datasets. We use two question-answering datasets: Trivia QA (Joshi et al. 2017) and Natural Questions (NQ) (Kwiatkowski et al. 2019).
Dataset Splits Yes Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset.
Hardware Specification Yes The experiments were conducted on eight RTX A6000 GPUs.
Software Dependencies No The paper mentions specific LLM models used (GPT-3.5 (gpt-3.5-turbo-0125), Llama 2-Base (7B), and Llama 2-Chat (7B)) but does not provide version numbers for ancillary software dependencies like programming languages or libraries.
Experiment Setup Yes Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset. The number of perturbations (np) and the number of samples (ns) were experimented with in different pairs. We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023). For GPT-3.5, we sample with temperature 1. For Llama models, we sample with temperature 0.6.