Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Authors: Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao CHEN, Xiang Li, Ming-Ming Cheng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use PANDA [6], TCGA-BRCA, and TCGA-NSCLC to evaluate the performance in cancer grading and sub-typing tasks. For cancer prognosis, we use TCGA-LUAD, TCGA-BRCA, TCGABLCA to evaluate performance on the survival analysis task. For external validation, we use CPTAC-LUAD, CPTAC-LUSC to evaluate the generalization ability. For cancer grading, we evaluate model performance using top-1 accuracy (Acc.). And area under the ROC curve (AUC) is used for sub-typing. For survival analysis, we employ the concordance index (C-index) [20]. To ensure robust statistical evaluation, we conducted a 1000-time bootstrapping evaluation and report the mean and 95% confidence interval.
Researcher Affiliation	Academia	1Nankai International Advanced Research Institute (Shenzhen Futian) 2VCIP, School of Computer Science, Nankai University 3Huazhong University of Science and Technology 4The Hong Kong University of Science and Technology
Pseudocode	No	The paper describes its methodology (Section 3) through descriptive text and mathematical equations (e.g., Equation 1, 2, 3, 4, 5). It does not contain any clearly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	Yes	The code is here.
Open Datasets	Yes	We use PANDA [6], TCGA-BRCA, and TCGA-NSCLC to evaluate the performance in cancer grading and sub-typing tasks. For cancer prognosis, we use TCGA-LUAD, TCGA-BRCA, TCGABLCA to evaluate performance on the survival analysis task. For external validation, we use CPTAC-LUAD, CPTAC-LUSC to evaluate the generalization ability... PANDA [6] (CC-BY-4.0) is a large-scale, multi-center dataset dedicated to prostate cancer detection and grading... The Non-Small Cell Lung Cancer (NSCLC) project of The Cancer Genome Atlas (TCGA)... The Breast Invasive Carcinoma (TCGA-BRCA) project... We supplemented the CAMELYON dataset (CC-BY-4.0) to evaluate qualitative and quantitative results of different methods. The dataset comprises CAMELYON-16 [3] and CAMELYON-17 [2]...
Dataset Splits	Yes	We randomly split the PANDA dataset into training, validation, and testing sets with a ratio of 7:1:2. Due to the limited data size, the remaining datasets are divided into training and testing sets with a ratio of 7:3.
Hardware Specification	Yes	Our E2E learning framework achieves significant performance improvements (e.g., +20% accuracy on PANDA) while maintaining computationally efficient (< 10 RTX3090 GPU hours on TCGABRCA).
Software Dependencies	No	The paper mentions using 'Adam [28] optimizer' and 'Adam W [37] optimizer' for training, but it does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other key software components used for implementation.
Experiment Setup	Yes	For cancer grading (PANDA), we employed an Adam [28] optimizer with a learning rate of 2 10 4 and a weight decay of 1 10 5, training for 200 epochs. For sub-typing (NSCLC, BRCA), we used an Adam W [37] optimizer with a learning rate of 8 10 5 and no weight decay, training for 75 epochs. For survival analysis (LUAD, BLCA, BRCA), we utilized an Adam W optimizer with a learning rate of 8 10 5 and a weight decay of 5 10 2, training for 30 epochs. The learning rate was adjusted using the Cosine annealing strategy. During training, we applied simple geometric data augmentations such as flipping and Random Resized Crop. All experiments are conducted on 3090 GPUs. We adjusted the batch size based on the 24GB memory limit and the number of samples in different datasets.