MEDFAIR: Benchmarking Fairness for Medical Imaging

Authors: Yongshuo Zong, Yongxin Yang, Timothy Hospedales

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings.We conduct extensive experiments across eleven algorithms, ten datasets, four sensitive attributes, and three model selection strategies to assess bias mitigation algorithms in both in-distribution and out-of-distribution settings. We report multiple evaluation metrics and conduct rigorous statistical tests to find whether any of the algorithms is significantly better. Having trained over 7,000 models using 6,800 GPU-hours...
Researcher Affiliation Collaboration Yongshuo Zong1, Yongxin Yang1, Timothy Hospedales1,2 1 School of Informatics, University of Edinburgh, 2 Samsung AI Centre, Cambridge
Pseudocode No The paper describes methods in narrative text and does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ys-zong/MEDFAIR.
Open Datasets Yes Ten datasets are included in MEDFAIR: Che Xpert (Irvin et al., 2019), MIMIC-CXR (Johnson et al., 2019), PAPILA (Kovalyk et al., 2022), HAM10000 (Tschandl et al., 2018), Fitzpatrick17k (Groh et al., 2021), OL3I (Chaves et al., 2021), COVID-CT-MD (Afshar et al., 2021), OCT (Farsiu et al., 2014), ADNI 1.5T, and ADNI 3T (Petersen et al., 2010), to evaluate the algorithms comprehensively, which are all publicly available to ensure the reproducibility. and Table A6: Access to the datasets. [table with specific URLs]
Dataset Splits Yes Data splitting for experiments: unless otherwise specified, we randomly split the whole dataset into training/validation/testing sets with a proportion of 80/10/10 for 2D datasets and 70/10/20 for 3D datasets.
Hardware Specification Yes The experiments are conducted on a Scientific Linux release version 7.9 with one NVIDIA A100-SXM-80GB GPU. We trained over 7,000 models using 0.77 GPU year.
Software Dependencies Yes The implementation is based on Python 3.9 and Py Torch 1.10.
Experiment Setup Yes To achieve the optimal performance of each algorithm for fair comparisons, we perform a Bayesian hyper-parameter optimization search for each algorithm and each combination of datasets and sensitive attributes using a machine learning platform Weights & Bias (Biewald, 2020). We use the batch size 1024 and 8 for 2D and 3D images respectively. SGD optimizer is used for all methods and we apply early stopping if the validation worst-case AUC does not improve for 5 epochs. The following hyper-parameter space is searched (20 runs for each method per dataset sensitive attribute), where [ ] means the value range, and {} means the discrete values: ERM/Resampling/Domain Ind Learning rate lr [1e 3, 1e 5].