Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed

Authors: Mélanie Bernhardt, Fabio De Sousa Ribeiro, Ben Glocker

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved.
Researcher Affiliation	Academia	Mélanie Bernhardt EMAIL Imperial College London, UK Fabio De Sousa Ribeiro EMAIL Imperial College London, UK Ben Glocker EMAIL Imperial College London, UK
Pseudocode	No	The paper describes methods and processes in detail, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like procedural steps outside of regular paragraph text.
Open Source Code	Yes	Code available at: https://github.com/melanibe/failure_detection_benchmark
Open Datasets	Yes	First, we evaluate confidence scores on 3 tasks from Med MNIST-v2 (Yang et al., 2021) (all with test set size above 5,000 images). Path MNIST (Kather et al., 2019) consists of non-overlapping patches from histology slides annotated with 9 colon diseases classes (with train and test splits from different clinical centers). Tissue MNIST (Ljosa et al., 2012) is comprised of kidney cortex cells microscope images, classified into 8 classes of cell subtypes. Organ AMNIST (Xu et al., 2019) is comprised of center slices from abdominal CT images in axial view, classified by organ type (11 classes). Secondly, we evaluate on three more challenging medical imaging tasks with higher resolution images, using data from the RSNA Pneumonia Detection Challenge (Shih et al., 2019), the Breast Ultrasound Image Dataset (Al-Dhabyani et al., 2020) (BUSI) and the Eye PACS2 Diabetic Retinopathy Detection Challenge dataset. The Eye PACS dataset is comprised of highresolution retina images depicting various stages of diabetic retinopathy. The original labels consisted of a 5-class classification task, here we follow the approach of Band et al. (2021); Leibig et al. (2017) and binarise the task to distinguish sight-threatening diabetic retinopathy (original classes {2, 3, 4}) and non-sightthreatening diabetic retinopathy (original classes {0, 1}). This dataset consists of 35,126 training, 10,906 validation and 42,670 test images. 2https://www.kaggle.com/c/diabetic-retinopathy-detection
Dataset Splits	Yes	For Med MNIST-v2 tasks: "we use the original train-val-test splits." For BUSI and RSNA: "For both datasets, we randomly split the data in 70%-10%-20% train-val-test splits." For Eye PACS: "This dataset consists of 35,126 training, 10,906 validation and 42,670 test images."
Hardware Specification	No	The paper specifies model architectures like ResNet-18, ResNet-50, DenseNet-121, and Wide ResNet-50, but does not mention any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions a "Python implementation" and discusses various deep learning methods that imply the use of libraries like PyTorch or TensorFlow, but it does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x).
Experiment Setup	Yes	All models are trained with an additional dropout layer after each weights layer to be able to run the MC-dropout comparison (with dropout probability p=0.1 for all experiments, based on validation performance). For all models the learning rate is divided by 10 after 10 epochs with no decrease in validation loss. We stop training after 15 consecutive epochs with no decrease in validation loss and chose the model with the lowest validation loss for testing. For both binary tasks (RSNA and Eye PACS) the classification threshold was chosen such that the FPR was 20% on the validation set. For MC-dropout and SWAG, we set the number of inference passes to 10 (we did not find any notable improvement when increasing the number of inference passes). For the Laplace method, we apply the Laplace approximation on the last layer weights using a Kronecker approximation of the Hessian, as per the recommended parameters in Daxberger et al. (2021). For SWAG (Maddox et al., 2019), we tuned the learning rate schedule on the validation set, and for DUQ, we followed Van Amersfoort et al. (2020) for tuning of hyperparameters.