Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DPA: A one-stop metric to measure bias amplification in classification datasets

Authors: Bhanu Tokas, Rahul Nair, Hannah Kerner

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on wellknown datasets like COMPAS (a tabular dataset), COCO, and Im Situ (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification. We conducted experiments on one tabular dataset (COMPAS [13]) and two image datasets (COCO [14] and Im Situ [1]). For COCO and Im Situ, we used gender as the protected attribute. In this section, we describe the experimental details and results for these datasets. For the COMPAS dataset, we used race as the protected attribute. The experiment setup and results for COMPAS are in Section E.
Researcher Affiliation	Academia	Bhanu Tokas Arizona State University EMAIL Rahul Nair Arizona State University EMAIL Hannah Kerner Arizona State University EMAIL
Pseudocode	No	The paper includes mathematical formulations for DPA (equations 5 and 6) and related concepts (equations 1-4, 10-18), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification.
Open Datasets	Yes	Our experiments on wellknown datasets like COMPAS (a tabular dataset), COCO, and Im Situ (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification. [13] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. In Ethics of data and analytics, pages 254 264. Auerbach Publications, 2022. [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740 755. Springer, 2014. [1] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5534 5542, 2016. In the CMNIST dataset, we used images of handwritten digits from the original MNIST [29] dataset
Dataset Splits	Yes	We sampled balanced and unbalanced sub-datasets from COCO. The balanced dataset is subject to the constraint in Equation 7. This resulted in 6156 images in the sub-dataset (3078 male and 3078 female images). We used the same 12 objects for the unbalanced sub-dataset, but relaxed the constraint from Equation 7 as shown in Equation 8. This resulted in a sub-dataset of 15743 images (8885 male and 6588 female images). We sampled unbalanced and balanced sub-datasets from Im Situ. To sample the balanced sub-dataset, we used the constraint in Equation 7. This resulted in a sub-dataset of 14600 images (7300 male and 7300 female images). For the unbalanced sub-dataset, we used a modified constrained (Equation 9). This resulted in a sub-dataset of 24301 images (14199 male and 10102 female images). We created balanced and unbalanced versions of the COMPAS dataset. For the unbalanced dataset, we sampled all available COMPAS instances (attributes, race labels, and recidivism labels) for each of the four A and T pairs. For the balanced dataset, we sampled an equal number of instances across the four A and T pairs.
Hardware Specification	Yes	We acknowledge the Research Computing at Arizona State University for providing HPC resources [28] that have contributed to the results reported in this paper. This computation was performed over an Intel Core i7 165H processor without any GPU acceleration.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) in the main text or supplementary sections.
Experiment Setup	Yes	We fine-tuned these models for 12 epochs on 8 dataset versions (4 versions of the balanced and 4 versions of the unbalanced sub-dataset). We trained the model for 15 epochs with a batch size of 32. We used the SGD optimizer, with a learning rate of 0.01 and momentum of 0.5. We used the binary cross-entropy loss for training. Table 14: COCO Masking additional parameters Parameter Optimizer Attacker Depth Learning Rate Num. epochs Batch size DPA Adam 2 0.001 100 64 LA Adam 2 0.001 100 64 Table 15: Im Situ additional parameters Parameter Optimizer Attacker Depth Learning Rate Num. epochs Batch size DPA Adam 2 0.001 100 128 LA Adam 2 0.001 100 128 Table 16: COMPAS additional parameters Parameter Optimizer Attacker Depth Learning Rate Num. epochs Batch size DPA Adam 2 0.005 50 512 LA Adam 2 0.005 50 512