Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training

Authors: Shangshu Qian, Viet Hung Pham, Thibaud Lutellier, Zeou Hu, Jungwon Kim, Lin Tan, Yaoliang Yu, Jiahao Chen, Sameena Shah

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we conduct the first empirical study to quantify the impact of software implementation on the fairness and its variance of DL systems. Our study of 22 mitigation techniques and five baselines reveals up to 12.6% fairness variance across identical training runs with identical seeds.
Researcher Affiliation Collaboration Shangshu Qian Purdue University West Lafayette, IN, USA EMAIL Hung Viet Pham University of Waterloo Vector Institute EMAIL ... Jiahao Chen J. P. Morgan AI Research New York, NY, USA EMAIL Sameena Shah J. P. Morgan AI Research New York, NY, USA EMAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Data and code availability: Experiment data and artifact for reproducibility study are available in a public Git Hub repository1. 1https://github.com/lin-tan/fairness-variance/
Open Datasets Yes The experiments are performed on four popular datasets (Celeb A, MS-COCO, im Situ, and CIFAR-10S) with three DL networks (Res Net-18, Res Net-50, and NIFR [47]), measured by seven popular bias metrics (Section 3).
Dataset Splits No For each technique, all the training runs are executed with the same training data (also the original training/test split), hyper-parameters, and optimizers. The paper mentions an “original training/test split” but does not explicitly provide percentages, sample counts, or clear details for a separate validation split in the main text.
Hardware Specification Yes Details of the hardware and software environment are in Appendix B.4.
Software Dependencies Yes Details of the hardware and software environment are in Appendix B.4.
Experiment Setup Yes For each technique, all the training runs are executed with the same training data (also the original training/test split), hyper-parameters, and optimizers. With the fixed seed, all training runs also have the same order of data and the same initial weights. We perform 16 FIT runs with the same random seed for each technique, and then evaluate the fairness of the trained models using seven bias metrics.