Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Authors: Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Katherine Metcalf, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs.
Researcher Affiliation	Collaboration	1Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, US. 2Apple Inc., Cupertino, CA, US. 3Center for Data Science, New York University, New York, NY, US.
Pseudocode	No	The paper describes methods and metrics using textual descriptions and mathematical equations, such as Eq. (1), (2), (3), (4), and (5), but does not include a dedicated pseudocode or algorithm block.
Open Source Code	No	The paper states '1Dataset available at https://github.com/apple/ml-synthbias.' which provides access to the Synth Bias dataset. However, it does not provide an explicit statement or a link to the open-source code for the UCer F metric or the general methodology described in the paper.
Open Datasets	Yes	To address these limitations, we curated a new dataset, Synth Bias1, a large-scale (31,756 samples), semantically challenging dataset annotated by human... 1Dataset available at https://github.com/apple/ml-synthbias.
Dataset Splits	No	The paper describes the composition of the Synth Bias dataset into 14,132 type1 sentences and 17,624 type2 sentences, totaling 31,756 verified samples. However, it does not explicitly provide information on how this dataset is further split into training, validation, or test sets for the purpose of reproducing the evaluations conducted on the LLMs.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments or evaluating the LLMs.
Software Dependencies	No	The paper mentions using specific LLMs for dataset generation (GPT-4o-2024-08-06) and for evaluation (e.g., Pythia, Mistral, Falcon, Llama models). It also refers to 'Open AI's text embedding model'. However, it does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other tools used to implement their UCer F metric or conduct the evaluations.
Experiment Setup	Yes	The paper explicitly describes the experiment setup for both the 'Intrinsic Task' and 'MCQ Task', including the exact prompts used to query the LLMs and how predictions and uncertainty estimations are collected. For example, for the intrinsic task: 'Given a gender-occupation co-reference resolution sentence, we ask the model to predict the next word following the prompt <sentence> The pronoun <pronoun> refers to the . We collect the model s prediction and uncertainty estimation based on the next-word probability predicted by the model.'