Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Authors: Jeremias Traub, Till Bungert, Carsten Lüth, Michael Baumgartner, Klaus Maier-Hein, Lena Maier-Hein, Paul Jaeger

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the relevance of AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
Researcher Affiliation Academia 1German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Germany 2Helmholtz Imaging, DKFZ Heidelberg, Germany 3DKFZ Heidelberg, Division of Medical Image Computing (MIC), Germany 4DKFZ Heidelberg, Division of Intelligent Medical Systems (IMSY), Germany 5Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, 69120 Heidelberg, Germany 6Faculty of Mathematics and Computer Science, University of Heidelberg, Germany 7National Center for Tumor Diseases (NCT) Heidelberg
Pseudocode No The paper includes mathematical derivations and formulas but does not feature explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code for reproducing our results and a Py Torch implementation of the AUGRC are available at: https://github.com/IML-DKFZ/fd-shifts.
Open Datasets Yes We evaluate SC methods on the FD-Shifts benchmark [Jäger et al., 2023], which considers a broad range of datasets and failure sources through various distribution shifts: SVHN [Netzer et al., 2011], CIFAR-10, and CIFAR-100 [Krizhevsky] are evaluated on semantic and non-semantic new-class shifts in a rotating fashion including Tiny Image Net [Le and Yang, 2015].
Dataset Splits Yes Based on the performance on the validation set, we choose the Deep Gambler reward hyperparameter and whether to use dropout (for the non-MCD-based CSFs).
Hardware Specification Yes Our method ranking study focuses on the evaluation of CSF performance based on the existing FD-Shifts benchmark, hence we required no GPU s for the analysis in Section. 4. As both AURC and AUGRC can be computed efficiently, (CPU) evaluation time for a single test set is less than a minute; evaluation on 500 bootstrap samples on a single CPU core take around 3 hours.
Software Dependencies No The paper mentions "a Py Torch implementation" of AUGRC but does not specify the version number for PyTorch or any other software dependencies needed for reproducibility.
Experiment Setup Yes The experiments are based on the same hyperparameters as reported in Table 4 in Jäger et al. [2023]. Based on the performance on the validation set, we choose the Deep Gambler reward hyperparameter and whether to use dropout (for the non-MCD-based CSFs). For the former, we select from [2.2, 3, 6, 10] on Wilds-Camelyon-17, CIFAR-10, and SVHN, from [2.2, 3, 6, 10, 15] on i Wild Cam and BREEDS-Entity-13, and from [2.2, 3, 6, 10, 12, 15, 20] on CIFAR-100.