A Closer Look at AUROC and AUPRC under Class Imbalance

Authors: Matthew McDermott, Haoran Zhang, Lasse Hansen, Giovanni Angelotti, Jack Gallifant

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes... Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets.
Researcher Affiliation Academia Matthew B. Mc Dermott Harvard Medical School... Haoran Zhang Massachusetts Institute of Technology... Lasse Hyldig Hansen Aarhus University... Giovanni Angelotti IRCCS Humanitas Research Hospital... Jack Gallifant Massachusetts Institute of Technology...
Pseudocode No The paper describes algorithms and procedures in narrative text and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured code-like steps.
Open Source Code Yes All code is available at https://github.com/hzhang0/auc_bias and https://github.com/ Lassehhansen/Arxiv MLClaim Search.
Open Datasets Yes We use the following four tabular binary classification datasets: adult [17], compas [14], lsac [413], and mimic [178].
Dataset Splits Yes We then split each dataset into 50% training, 25% validation, 25% test sets, stratified by the group.
Hardware Specification No The paper describes running synthetic and real-world experiments and references code availability in a Colab notebook, but it does not specify any particular hardware components like CPU or GPU models used for these experiments.
Software Dependencies No The paper mentions using 'XGBoost models [65]' and 'random hyperparameter search [37]' but does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments.
Experiment Setup Yes Experimental Setup. Let y {0, 1} be the binary label, s [0, 1] be the predicted score, and a {1, 2} be the subpopulation. We fix Py|a(y = 1|a = 1) = 0.05 and Py|a(y = 1|a = 2) = 0.01. We sample a dataset for each group... We run these experiments across 20 randomly sampled datasets and show the mean and an empirical 90% confidence interval around the mean... We train XGBoost models [65] on each dataset. For each task, we iterate over a grid of per-group weights in order to create a diverse set of models... we conduct a random hyperparameter search [37] with 50 runs. Hyperparameter grid: max depth: {1, 2, ..., 9} learning rate: [0,01, 0.3] number of estimators: [50, 1000] min child weight: {1, 2, ..., 9} use protected attribute as input feature: {yes, no} group weight of higher prevalence group: {1, 2, 3, 4, 5, 10, 15, 20, 25, 50}.