Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Guiding LLM Decision-Making with Fairness Reward Models

Authors: Zara Hall, Melanie Subbiah, Thomas Zollo, Kathleen McKeown, Richard S. Zemel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. When applied to real-world decision-making tasks including recidivism prediction and social media moderation, our approach consistently improves fairness while matching, or even surpassing, baseline accuracy. Here we detail the experimental setup for applying our trained FRM to the previously described downstream tasks.
Researcher Affiliation Academia Zara Hall Columbia University EMAIL Melanie Subbiah Columbia University EMAIL Thomas P. Zollo Columbia University EMAIL Kathleen Mc Keown Columbia University EMAIL Richard Zemel Columbia University EMAIL
Pseudocode No The paper describes the framework and methods in narrative text and diagrams (e.g., Figure 2) but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/zarahall/fairness-prms.
Open Datasets Yes In order to facilitate future research in this area, our dataset2 and trained FRM3 are publicly available. 2https://huggingface.com/datasets/zarahall/fairness-prm-training-data 3https://huggingface.co/zarahall/fairness-reward-model
Dataset Splits No The paper describes training data generation for the FRM using a subsample of 4395 questions from the BBQ dataset and generating 255,000 reasoning steps. It mentions validation on "held-out BBQ data." For the downstream tasks (COMPAS, Civil Comments, Bias in Bios), the paper states using these datasets but does not explicitly provide specific train/test/validation splits, proportions, or sample counts used for their evaluation.
Hardware Specification Yes We train for 2 epochs on 255,000 reasoning steps (for PRMs) or 79,000 reasoning chains (for ORMs) using 4 NVIDIA A100 GPUs with 40GB of memory each.
Software Dependencies No All experiments were conducted using vllm and Hugging Face's transformers libraries. The paper does not specify version numbers for these libraries.
Experiment Setup Yes We initialize our reward model training from a LLa MA-3.2-1B-Instruct base model... we train with binary cross-entropy loss and use the Adam W optimizer with a learning rate of 2e-5, a batch size of 128, and β parameters (0.9, 0.95). We train for 2 epochs on 255,000 reasoning steps (for PRMs) or 79,000 reasoning chains (for ORMs).