A Closer Look at AUROC and AUPRC under Class Imbalance
Authors: Matthew McDermott, Haoran Zhang, Lasse Hansen, Giovanni Angelotti, Jack Gallifant
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes... Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. |
| Researcher Affiliation | Academia | Matthew B. Mc Dermott Harvard Medical School... Haoran Zhang Massachusetts Institute of Technology... Lasse Hyldig Hansen Aarhus University... Giovanni Angelotti IRCCS Humanitas Research Hospital... Jack Gallifant Massachusetts Institute of Technology... |
| Pseudocode | No | The paper describes algorithms and procedures in narrative text and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured code-like steps. |
| Open Source Code | Yes | All code is available at https://github.com/hzhang0/auc_bias and https://github.com/ Lassehhansen/Arxiv MLClaim Search. |
| Open Datasets | Yes | We use the following four tabular binary classification datasets: adult [17], compas [14], lsac [413], and mimic [178]. |
| Dataset Splits | Yes | We then split each dataset into 50% training, 25% validation, 25% test sets, stratified by the group. |
| Hardware Specification | No | The paper describes running synthetic and real-world experiments and references code availability in a Colab notebook, but it does not specify any particular hardware components like CPU or GPU models used for these experiments. |
| Software Dependencies | No | The paper mentions using 'XGBoost models [65]' and 'random hyperparameter search [37]' but does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | Experimental Setup. Let y {0, 1} be the binary label, s [0, 1] be the predicted score, and a {1, 2} be the subpopulation. We fix Py|a(y = 1|a = 1) = 0.05 and Py|a(y = 1|a = 2) = 0.01. We sample a dataset for each group... We run these experiments across 20 randomly sampled datasets and show the mean and an empirical 90% confidence interval around the mean... We train XGBoost models [65] on each dataset. For each task, we iterate over a grid of per-group weights in order to create a diverse set of models... we conduct a random hyperparameter search [37] with 50 runs. Hyperparameter grid: max depth: {1, 2, ..., 9} learning rate: [0,01, 0.3] number of estimators: [50, 1000] min child weight: {1, 2, ..., 9} use protected attribute as input feature: {yes, no} group weight of higher prevalence group: {1, 2, 3, 4, 5, 10, 15, 20, 25, 50}. |