Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Class-Discriminative Attention Maps for Vision Transformers

Authors: Lennart Brocki, Jakub Binda, Neo Christopher Chung

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our quantitative benchmarks include correctness, compactness, and class sensitivity, in comparison to 7 other importance estimators. Vanilla, Smooth, and Integrated CDAM excel across all three benchmarks. In particular, our results suggest that existing importance estimators may not provide sufficient class-sensitivity. We demonstrate the utility of CDAM in medical images by training and explaining malignancy and biomarker prediction models based on lung Computed Tomography (CT) scans.
Researcher Affiliation	Academia	Lennart Brocki EMAIL Jakub Binda EMAIL Neo Christopher Chung EMAIL Institute of Informatics, University of Warsaw
Pseudocode	No	The paper describes methods using mathematical equations and text, but no explicitly labeled pseudocode block or algorithm section is present.
Open Source Code	Yes	Code available: https://github.com/lenbrocki/CDAM
Open Datasets	Yes	We conduct several quantitative evaluations focusing on correctness, class sensitivity, and compactness. By using the Image Net samples (Deng et al., 2009) with multiple objects (Beyer et al., 2020) and applying importance estimators for different classes, we quantify the level of class-discrimination. ... Lastly, we have applied CDAM on a Vi T fine-tuned on the Lung Image Database Consortium image collection (LIDC) (Armato III et al., 2011).
Dataset Splits	Yes	Training, validation, and test sets (in the ratios of 0.7225, 0.1275, 0.15) were stratified by and balanced according to these labels, e.g., benign and malignant. ... The LIDC dataset was split into 5 folds and stratified according to malignancy status.
Hardware Specification	No	This research was carried out with the support of the Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw (ICM UW) under computational allocation no GDM-3540; the IDUB program (Excellence Initiative Research University), the NVIDIA Corporation s Academic Hardware Grant; and the Google Cloud Research Innovators program. While NVIDIA hardware and Google Cloud are mentioned, specific GPU/CPU models or detailed specifications are not provided in the paper.
Software Dependencies	No	We use random resized cropping and horizontal flipping with Py Torch default arguments as augmentation, Adam optimizer with learning rate 3 10 4, batch size of 128, and train for 10 epochs. The paper mentions PyTorch and Adam optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We use random resized cropping and horizontal flipping with Py Torch default arguments as augmentation, Adam optimizer with learning rate 3 10 4, batch size of 128, and train for 10 epochs. The parameters of the Vi T backbone are frozen during training, so only the classifier head is trainable. ... In a parameter sweep, we varied the number of trainable layers (10 50) and dropout rates (0.0 0.09), where the learning rate was exponentially decaying (α = 0.0003 and β = 0.95). The best accuracy on the test set of 0.85 was obtained with 50 trainable layers and dropout rate of 0.0031.