Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs
Authors: Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Katherine Metcalf, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs. |
| Researcher Affiliation | Collaboration | 1Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, US. 2Apple Inc., Cupertino, CA, US. 3Center for Data Science, New York University, New York, NY, US. |
| Pseudocode | No | The paper describes methods and metrics using textual descriptions and mathematical equations, such as Eq. (1), (2), (3), (4), and (5), but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper states '1Dataset available at https://github.com/apple/ml-synthbias.' which provides access to the Synth Bias dataset. However, it does not provide an explicit statement or a link to the open-source code for the UCer F metric or the general methodology described in the paper. |
| Open Datasets | Yes | To address these limitations, we curated a new dataset, Synth Bias1, a large-scale (31,756 samples), semantically challenging dataset annotated by human... 1Dataset available at https://github.com/apple/ml-synthbias. |
| Dataset Splits | No | The paper describes the composition of the Synth Bias dataset into 14,132 type1 sentences and 17,624 type2 sentences, totaling 31,756 verified samples. However, it does not explicitly provide information on how this dataset is further split into training, validation, or test sets for the purpose of reproducing the evaluations conducted on the LLMs. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments or evaluating the LLMs. |
| Software Dependencies | No | The paper mentions using specific LLMs for dataset generation (GPT-4o-2024-08-06) and for evaluation (e.g., Pythia, Mistral, Falcon, Llama models). It also refers to 'Open AI's text embedding model'. However, it does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other tools used to implement their UCer F metric or conduct the evaluations. |
| Experiment Setup | Yes | The paper explicitly describes the experiment setup for both the 'Intrinsic Task' and 'MCQ Task', including the exact prompts used to query the LLMs and how predictions and uncertainty estimations are collected. For example, for the intrinsic task: 'Given a gender-occupation co-reference resolution sentence, we ask the model to predict the next word following the prompt <sentence> The pronoun <pronoun> refers to the . We collect the model s prediction and uncertainty estimation based on the next-word probability predicted by the model.' |