Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
Authors: Kyle Cox, Jiawei Xu, Yikun Han, Rong Xu, Tianhao Li, Chi-Yang Hsu, Tianlong Chen, Walter Gerych, Ying Ding
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments Perturbation Method Let Q(Y | x, θ) be the response distribution of the language model at an input x and parameterization θ. ... We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023). |
| Researcher Affiliation | Academia | Kyle Cox1, Jiawei Xu1, Yikun Han2, Rong Xu1, Tianhao Li1, Chi-Yang Hsu1, Tianlong Chen3, Walter Gerych4, Ying Ding1 1University of Texas at Austin 2University of Michigan-Ann Arbor 3University of North Carolina at Chapel Hill 4MIT EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and formulas using mathematical notation and descriptive text, but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks or sections. |
| Open Source Code | Yes | Code https://github.com/xocelyk/paraphrase-uncertainty |
| Open Datasets | Yes | Datasets. We use two question-answering datasets: Trivia QA (Joshi et al. 2017) and Natural Questions (NQ) (Kwiatkowski et al. 2019). |
| Dataset Splits | Yes | Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset. |
| Hardware Specification | Yes | The experiments were conducted on eight RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions specific LLM models used (GPT-3.5 (gpt-3.5-turbo-0125), Llama 2-Base (7B), and Llama 2-Chat (7B)) but does not provide version numbers for ancillary software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | Implementation details. For our experiments, we select the first 1,000 question-answer pairs from the validation split of each dataset. The number of perturbations (np) and the number of samples (ns) were experimented with in different pairs. We run each experiment 5 times, and report the means and standard deviations of evaluation metrics. We utilize three LLMs: GPT-3.5 (gpt-3.5-turbo-0125) (Ouyang et al. 2022), Llama 2-Base (7B), and Llama 2-Chat (7B) (Touvron et al. 2023). For GPT-3.5, we sample with temperature 1. For Llama models, we sample with temperature 0.6. |