reproducibilityindex.ai

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Authors: Jeremias Traub, Till Bungert, Carsten Lüth, Michael Baumgartner, Klaus Maier-Hein, Lena Maier-Hein, Paul Jaeger

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the relevance of AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
Researcher Affiliation	Academia	1German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Germany 2Helmholtz Imaging, DKFZ Heidelberg, Germany 3DKFZ Heidelberg, Division of Medical Image Computing (MIC), Germany 4DKFZ Heidelberg, Division of Intelligent Medical Systems (IMSY), Germany 5Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, 69120 Heidelberg, Germany 6Faculty of Mathematics and Computer Science, University of Heidelberg, Germany 7National Center for Tumor Diseases (NCT) Heidelberg
Pseudocode	No	The paper includes mathematical derivations and formulas but does not feature explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	The code for reproducing our results and a Py Torch implementation of the AUGRC are available at: https://github.com/IML-DKFZ/fd-shifts.
Open Datasets	Yes	We evaluate SC methods on the FD-Shifts benchmark [Jäger et al., 2023], which considers a broad range of datasets and failure sources through various distribution shifts: SVHN [Netzer et al., 2011], CIFAR-10, and CIFAR-100 [Krizhevsky] are evaluated on semantic and non-semantic new-class shifts in a rotating fashion including Tiny Image Net [Le and Yang, 2015].
Dataset Splits	Yes	Based on the performance on the validation set, we choose the Deep Gambler reward hyperparameter and whether to use dropout (for the non-MCD-based CSFs).
Hardware Specification	Yes	Our method ranking study focuses on the evaluation of CSF performance based on the existing FD-Shifts benchmark, hence we required no GPU s for the analysis in Section. 4. As both AURC and AUGRC can be computed efficiently, (CPU) evaluation time for a single test set is less than a minute; evaluation on 500 bootstrap samples on a single CPU core take around 3 hours.
Software Dependencies	No	The paper mentions "a Py Torch implementation" of AUGRC but does not specify the version number for PyTorch or any other software dependencies needed for reproducibility.
Experiment Setup	Yes	The experiments are based on the same hyperparameters as reported in Table 4 in Jäger et al. [2023]. Based on the performance on the validation set, we choose the Deep Gambler reward hyperparameter and whether to use dropout (for the non-MCD-based CSFs). For the former, we select from [2.2, 3, 6, 10] on Wilds-Camelyon-17, CIFAR-10, and SVHN, from [2.2, 3, 6, 10, 15] on i Wild Cam and BREEDS-Entity-13, and from [2.2, 3, 6, 10, 12, 15, 20] on CIFAR-100.