Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Group Sufficiency Under Label Bias

Authors: Haoran Zhang, Olawale Salaudeen, Marzyeh Ghassemi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We theoretically show that enforcing fairness with respect to label biased data necessarily results in group miscalibration with respect to the true labels. We then propose a regularizer which minimizes an upper bound on the sufficiency gap by penalizing a conditional mutual information term. Across experiments on eight tabular, image, and text datasets with both synthetic and real label noise, we find that our method reduces the sufficiency gap by up to 7.2% with no significant decrease in overall accuracy.
Researcher Affiliation	Academia	Haoran Zhang Olawale Salaudeen Marzyeh Ghassemi Massachusetts Institute of Technology EMAIL
Pseudocode	Yes	Algorithm 1: Single Step Update of CMI-REG Algorithm 2: Sufficiency Regularised Learning under Class Attribute Label Noise
Open Source Code	Yes	1Code: https://github.com/MLfor Health/sufficiency_label_bias
Open Datasets	Yes	Table 1: Datasets used this paper. All attributes contain two groups. n: number of samples, K: number of classes. Data processing details can be found in Appendix B.1. adult [80] Tabular Income 50k Gender 32,561 2 Synthetic lsac [81] Tabular Student passes the bar Race 18,337 2 Synthetic crime [82, 83] Tabular Binned rate of violent crime Primary ethnicity 1,994 5 Synthetic income [84] Tabular Binned income Race 1,445,699 3 Synthetic grades [85] Tabular Student passes exam Gender 856 2 Real civilcomments [86] Text Comment is toxic Contains identity 448,000 2 Real clothing1m [87] Image Type of clothing Contains face 1,072,409 14 Real cifar10ns [88, 89] Image Image classification Image is grayscale 60,000 10 Real
Dataset Splits	Yes	All datasets are divided into 60%/20%/20% training/validation/test splits.
Hardware Specification	No	The paper describes the types of models used (MLP, ResNet-18, BERT-base) and general training procedures but does not specify any particular hardware components like CPU or GPU models used for experiments.
Software Dependencies	No	The paper mentions the Adam optimizer, ResNet-18, and BERT-base models but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies.
Experiment Setup	Yes	Hyperparameters were selected using a random hyperparameter search [78] with 20 runs. For the full hyperparameter grid for each method, see Appendix B.3. Each hyperparameter setting was repeated three times with different random seeds, which affects the dataset split, model initialization, and random noise (in the case of synthetic noise).