Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed
Authors: Mélanie Bernhardt, Fabio De Sousa Ribeiro, Ben Glocker
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved. |
| Researcher Affiliation | Academia | Mélanie Bernhardt EMAIL Imperial College London, UK Fabio De Sousa Ribeiro EMAIL Imperial College London, UK Ben Glocker EMAIL Imperial College London, UK |
| Pseudocode | No | The paper describes methods and processes in detail, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like procedural steps outside of regular paragraph text. |
| Open Source Code | Yes | Code available at: https://github.com/melanibe/failure_detection_benchmark |
| Open Datasets | Yes | First, we evaluate confidence scores on 3 tasks from Med MNIST-v2 (Yang et al., 2021) (all with test set size above 5,000 images). Path MNIST (Kather et al., 2019) consists of non-overlapping patches from histology slides annotated with 9 colon diseases classes (with train and test splits from different clinical centers). Tissue MNIST (Ljosa et al., 2012) is comprised of kidney cortex cells microscope images, classified into 8 classes of cell subtypes. Organ AMNIST (Xu et al., 2019) is comprised of center slices from abdominal CT images in axial view, classified by organ type (11 classes). Secondly, we evaluate on three more challenging medical imaging tasks with higher resolution images, using data from the RSNA Pneumonia Detection Challenge (Shih et al., 2019), the Breast Ultrasound Image Dataset (Al-Dhabyani et al., 2020) (BUSI) and the Eye PACS2 Diabetic Retinopathy Detection Challenge dataset. The Eye PACS dataset is comprised of highresolution retina images depicting various stages of diabetic retinopathy. The original labels consisted of a 5-class classification task, here we follow the approach of Band et al. (2021); Leibig et al. (2017) and binarise the task to distinguish sight-threatening diabetic retinopathy (original classes {2, 3, 4}) and non-sightthreatening diabetic retinopathy (original classes {0, 1}). This dataset consists of 35,126 training, 10,906 validation and 42,670 test images. 2https://www.kaggle.com/c/diabetic-retinopathy-detection |
| Dataset Splits | Yes | For Med MNIST-v2 tasks: "we use the original train-val-test splits." For BUSI and RSNA: "For both datasets, we randomly split the data in 70%-10%-20% train-val-test splits." For Eye PACS: "This dataset consists of 35,126 training, 10,906 validation and 42,670 test images." |
| Hardware Specification | No | The paper specifies model architectures like ResNet-18, ResNet-50, DenseNet-121, and Wide ResNet-50, but does not mention any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions a "Python implementation" and discusses various deep learning methods that imply the use of libraries like PyTorch or TensorFlow, but it does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x). |
| Experiment Setup | Yes | All models are trained with an additional dropout layer after each weights layer to be able to run the MC-dropout comparison (with dropout probability p=0.1 for all experiments, based on validation performance). For all models the learning rate is divided by 10 after 10 epochs with no decrease in validation loss. We stop training after 15 consecutive epochs with no decrease in validation loss and chose the model with the lowest validation loss for testing. For both binary tasks (RSNA and Eye PACS) the classification threshold was chosen such that the FPR was 20% on the validation set. For MC-dropout and SWAG, we set the number of inference passes to 10 (we did not find any notable improvement when increasing the number of inference passes). For the Laplace method, we apply the Laplace approximation on the last layer weights using a Kronecker approximation of the Hessian, as per the recommended parameters in Daxberger et al. (2021). For SWAG (Maddox et al., 2019), we tuned the learning rate schedule on the validation set, and for DUQ, we followed Van Amersfoort et al. (2020) for tuning of hyperparameters. |