Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Prediction: Managing the Repercussions of Machine Learning Applications

Authors: Aline Weber, Blossom Metevier, Yuriy Brun, Philip S. Thomas, Bruno Silva

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate, using real-life data, that THEIA can identify models that achieve high accuracy while ensuring, with high confidence, that constraints on their repercussions are satisfied. We empirically analyze THEIA s performance in two real-life settings, while varying both the amount of training data and the amount of repercussions that a classifier s predictions have.
Researcher Affiliation	Academia	Aline Weber University of Massachusetts Amherst, MA 01003, USA EMAIL Blossom Metevier University of Massachusetts Amherst, MA 01003, USA EMAIL Yuriy Brun University of Massachusetts Amherst, MA 01003, USA EMAIL Philip S. Thomas University of Massachusetts Amherst, MA 01003, USA EMAIL Bruno Castro da Silva University of Massachusetts Amherst, MA 01003, USA EMAIL
Pseudocode	Yes	Algorithm 1 THEIA Input: 1) D = {(Xi, Yi, b Y β i , Rβ i )}n i=1; 2) confidence level δ; 3) tolerance value τ; 4) behavior model β; and 5) Bound {Hoeff, ttest}. Output: Model θc or NSF. Algorithm 2 cost Input: 1) the vector θ that parameterizes model π; 2) Dc = {(Xi, Yi, b Y β i , Rβ i )}m i=1; 3) confidence level δ; 4) tolerance value τ; 5) the behavior model β; 6) Bound {Hoeff, ttest}; and 7) the number of data points in Df, denoted n Df . Output: The cost of π. Algorithm 3 THEIA with Multiple Constraints Input: 1) D = {(Xi, Yi, b Y β i , Rβ i )}n i=1; 2) the number of repercussion constraints, k; 3) a sequence of Boolean conditionals (cj)k j=1 such that for j {1, ..., k}, cj(Xi, Yi) indicates whether the event associated with the data point (Xi, Yi, b Y β i , Rβ i ) occurs; 4) confidence levels δ = (δj)k j=1, where each δj (0, 1) corresponds to repercussion constraint gj; 5) tolerance values τ = (τj)k j=1, where each τj is the tolerance associated with repercussion constraint gj; 6) the behavior model β; and 7) Bound {Hoeff, ttest}. Output: Model θc or NSF. Algorithm 4 cost with Multiple Constraints
Open Source Code	No	The code is not yet publicly available. We are in the final stages of developing a library that will be made publicly available to the community, and we expect to release it in the coming months.
Open Datasets	Yes	In our first experiment (EXP-1), a classifier makes predictions about whether youth in the U.S. foster care system are likely to get a job. [...] EXP-1 uses two data sources from the National Data Archive on Child Abuse and Neglect [1], which include financial, educational, and well-being data on youth over time and during their transition from foster care to adulthood. [...] In our second experiment (EXP-2), a bank s lending decisions are informed by a classifier predicting repayment success. [...] EXP-2 uses real-life financial information for 250,000 clients who requested loans [44]. [1] Administration on Children, Youth & Families. National Data Archive on Child Abuse and Neglect (NDACAN). 2021. URL https://www.ndacan.acf.hhs.gov/. [44] Will Cukierski Credit Fusion. Give Me Some Credit Dataset, 2011. URL https://www. kaggle.com/competitions/Give Me Some Credit.
Dataset Splits	Yes	We partitioned the dataset D into Dc and Df using a stratified sampling approach where Dc contains 60% of the data and Df contains 40% of the data.
Hardware Specification	Yes	Experiments were conducted on a computer cluster containing 50 computer nodes with 28 cores (2 processors, 14 cores each 56 cores with hyper-threading) Xeon E5-2680 v4 @ 2.40GHz, 128GB RAM, 200GB local SSD disk, and 50 compute nodes with 28 cores (2 processors, 18 cores each 72 cores with hyper-threading) Xeon Gold 6240 CPU @ 2.60GHz, 192GB RAM, and 240GB local SSD disk. Each node had 3GB of allocated memory.
Software Dependencies	No	In all experiments, our implementation of THEIA used ES [36, 38] to search over the space of candidate models and the ttest concentration inequality.
Experiment Setup	Yes	In all experiments, our implementation of THEIA used ES [36, 38] to search over the space of candidate models and the ttest concentration inequality. [...] The confidence level δt for all objectives is 0.1. [...] To do so, we define two repercussion objectives, g0 and g1. Let t {0, 1} and gt(θ):=E[Rπθ\|Xr=t] τt, where τt= 1 nt Pn d=1 Rβ d JXr d=t K is the average observed repercussion caused by β on people of race (or age group) Xr=t and where nt= Pn d=1JXr d=t K. [...] We vary α from 0 to 1 in increments of 0.1.