Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Teaching Models to Express Their Uncertainty in Words

Authors: Stephanie Lin, Jacob Hilton, Owain Evans

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that a GPT-3 model can learn to express uncertainty about answers using natural language without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. 90% confidence or high confidence ). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and is sensitive to uncertainty in its own answers, rather than imitating human examples. For testing calibration, we introduce the Calibrated Math suite of tasks. We compare the calibration of uncertainty expressed in words ( verbalized probability ) to uncertainty extracted from model logits.
Researcher Affiliation	Collaboration	Stephanie Lin EMAIL University of Oxford Jacob Hilton EMAIL Open AI Owain Evans EMAIL University of Oxford
Pseudocode	No	The paper describes methods and procedures in narrative text, without providing any formally structured pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide links to a code repository.
Open Datasets	No	We introduce a new test suite for calibration. Calibrated Math is a suite of elementary mathematics problems. For each question, a model must produce both a numerical answer and a confidence in its answer (see Figure 1). There are many types of question, which vary substantially in content and in difficulty for GPT-3. This allows us to test how calibration generalizes under distribution shifts (by shifting the question type) and makes for a challenging test (see Figure 3). Since GPT-3 s math abilities differ greatly from humans, GPT-3 cannot simply imitate human expressions of uncertainty. Calibrated Math is a test suite consisting of 21 arithmetic tasks... For each task, questions and answers are programmatically generated.
Dataset Splits	Yes	Our main experiments use the Add-subtract training set (Figure 3). This consists of tasks in Calibrated Math that involve addition or subtraction and have a unique correct answer. The evaluation set (called Multi-answer ) consists of questions with multiple correct answers... models trained on Add-subtract are also evaluated on a second evaluation set called Multiply-divide. For each sub-task T we randomly sample 100 questions and generate GPT-3 s zero-shot answers (using greedy decoding) for a total of \|T\| 100 10k inputs.
Hardware Specification	No	The paper states: "For our experiments, we used the 175-billion parameter GPT-3 model ( davinci ) via the Open AI API (Brown et al., 2020)." This identifies the model used but does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) for running the experiments.
Software Dependencies	No	The paper mentions using the "Open AI API" and "Open AI s fine-tuning API" but does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Models are trained for one epoch to prevent overfitting, using the default hyperparameters from Open AI s fine-tuning API with learning_rate_multiplier = 0.1 (Open AI, 2021). We additionally carry out a form of early stopping that takes into account the difference between the sub-task level targets pˆT , and a model s binary accuracy of 0/1 on any individual question.