Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Group Sufficiency Under Label Bias
Authors: Haoran Zhang, Olawale Salaudeen, Marzyeh Ghassemi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically show that enforcing fairness with respect to label biased data necessarily results in group miscalibration with respect to the true labels. We then propose a regularizer which minimizes an upper bound on the sufficiency gap by penalizing a conditional mutual information term. Across experiments on eight tabular, image, and text datasets with both synthetic and real label noise, we find that our method reduces the sufficiency gap by up to 7.2% with no significant decrease in overall accuracy. |
| Researcher Affiliation | Academia | Haoran Zhang Olawale Salaudeen Marzyeh Ghassemi Massachusetts Institute of Technology EMAIL |
| Pseudocode | Yes | Algorithm 1: Single Step Update of CMI-REG Algorithm 2: Sufficiency Regularised Learning under Class Attribute Label Noise |
| Open Source Code | Yes | 1Code: https://github.com/MLfor Health/sufficiency_label_bias |
| Open Datasets | Yes | Table 1: Datasets used this paper. All attributes contain two groups. n: number of samples, K: number of classes. Data processing details can be found in Appendix B.1. adult [80] Tabular Income 50k Gender 32,561 2 Synthetic lsac [81] Tabular Student passes the bar Race 18,337 2 Synthetic crime [82, 83] Tabular Binned rate of violent crime Primary ethnicity 1,994 5 Synthetic income [84] Tabular Binned income Race 1,445,699 3 Synthetic grades [85] Tabular Student passes exam Gender 856 2 Real civilcomments [86] Text Comment is toxic Contains identity 448,000 2 Real clothing1m [87] Image Type of clothing Contains face 1,072,409 14 Real cifar10ns [88, 89] Image Image classification Image is grayscale 60,000 10 Real |
| Dataset Splits | Yes | All datasets are divided into 60%/20%/20% training/validation/test splits. |
| Hardware Specification | No | The paper describes the types of models used (MLP, ResNet-18, BERT-base) and general training procedures but does not specify any particular hardware components like CPU or GPU models used for experiments. |
| Software Dependencies | No | The paper mentions the Adam optimizer, ResNet-18, and BERT-base models but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies. |
| Experiment Setup | Yes | Hyperparameters were selected using a random hyperparameter search [78] with 20 runs. For the full hyperparameter grid for each method, see Appendix B.3. Each hyperparameter setting was repeated three times with different random seeds, which affects the dataset split, model initialization, and random noise (in the case of synthetic noise). |