reproducibilityindex.ai

A case for reframing automated medical image classification as segmentation

Authors: Sarah Hooper, Mayee Chen, Khaled Saab, Kush Bhatia, Curtis Langlotz, Christopher Ré

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then implement methods for using segmentation models to classify medical images, which we call segmentation-for-classification, and compare these methods against traditional classification on three retrospective datasets (n=2,018 19,237).
Researcher Affiliation	Academia	Sarah M. Hooper Electrical Engineering Stanford University Mayee F. Chen Computer Science Stanford University Khaled Saab Electrical Engineering Stanford University Kush Bhatia Computer Science Stanford University Curtis Langlotz Radiology and Biomedical Data Science Stanford University Christopher Ré Computer Science Stanford University
Pseudocode	Yes	Specifically, in Algorithm 1, we give the algorithm we use to compute a binary, image-level label from a probabilistic segmentation mask. In Algorithm 2 we provide the method we use to compute a probabilistic image-level label from a probabilistic segmentation mask.
Open Source Code	No	The paper does not provide an explicit statement or a direct link to a source-code repository for the methodology described.
Open Datasets	Yes	We also evaluate three medical imaging datasets: CANDID, in which we aim to classify pneumothorax in chest x-rays (n=19,237) [29]; ISIC, in which we aim to classify melanoma from skin lesion photographs (n=2,750) [30]; and SPINE, in which we aim to classify cervical fractures in CT scans (n=2,018, RSNA 2022 Cervical Spine Fracture Detection Challenge).
Dataset Splits	Yes	We split this dataset randomly into 60% training images, 20% validation images, and 20% test images. We use the splits provided by the ISIC challenge organizers, resulting in 2000 training images, 150 validation images, and 600 test images. We randomly split this dataset into 60% training, 20% validation, and 20% test.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., specific GPU or CPU models, memory) used for running the experiments.
Software Dependencies	Yes	We conduct our experiments using Pytorch Lightning (Pytorch version 1.9.0, Lightning version 1.5.10) [60, 61].
Experiment Setup	Yes	We provide additional training details and information on hyperparameter tuning in Appendix A4.2 and A5.2... We train each network with an Adam optimizer and a learning rate of 1e-4, tuned from [1e-6, 1e-5, 1e-4, 1e-3]. We evaluated learning rates [1e-6, 1e-5, 1e-4, 1e-3, 1e-2] for each summarizing function and chose the learning rate that maximized the validation AUROC.