Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Closer Look at AUROC and AUPRC under Class Imbalance
Authors: Matthew McDermott, Haoran Zhang, Lasse Hansen, Giovanni Angelotti, Jack Gallifant
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes... Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. |
| Researcher Affiliation | Academia | Matthew B. Mc Dermott Harvard Medical School... Haoran Zhang Massachusetts Institute of Technology... Lasse Hyldig Hansen Aarhus University... Giovanni Angelotti IRCCS Humanitas Research Hospital... Jack Gallifant Massachusetts Institute of Technology... |
| Pseudocode | No | The paper describes algorithms and procedures in narrative text and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured code-like steps. |
| Open Source Code | Yes | All code is available at https://github.com/hzhang0/auc_bias and https://github.com/ Lassehhansen/Arxiv MLClaim Search. |
| Open Datasets | Yes | We use the following four tabular binary classification datasets: adult [17], compas [14], lsac [413], and mimic [178]. |
| Dataset Splits | Yes | We then split each dataset into 50% training, 25% validation, 25% test sets, stratified by the group. |
| Hardware Specification | No | The paper describes running synthetic and real-world experiments and references code availability in a Colab notebook, but it does not specify any particular hardware components like CPU or GPU models used for these experiments. |
| Software Dependencies | No | The paper mentions using 'XGBoost models [65]' and 'random hyperparameter search [37]' but does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | Experimental Setup. Let y {0, 1} be the binary label, s [0, 1] be the predicted score, and a {1, 2} be the subpopulation. We fix Py|a(y = 1|a = 1) = 0.05 and Py|a(y = 1|a = 2) = 0.01. We sample a dataset for each group... We run these experiments across 20 randomly sampled datasets and show the mean and an empirical 90% confidence interval around the mean... We train XGBoost models [65] on each dataset. For each task, we iterate over a grid of per-group weights in order to create a diverse set of models... we conduct a random hyperparameter search [37] with 50 runs. Hyperparameter grid: max depth: {1, 2, ..., 9} learning rate: [0,01, 0.3] number of estimators: [50, 1000] min child weight: {1, 2, ..., 9} use protected attribute as input feature: {yes, no} group weight of higher prevalence group: {1, 2, 3, 4, 5, 10, 15, 20, 25, 50}. |