Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification

Authors: Junjie Zhou, WEI SHAO, Yagao Yue, Wei Mu, Peng Wan, Qi Zhu, Daoqiang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three cancer cohorts derived from the cancer genome atlas (TCGA), and the experimental results indicate the advantage of MAPLE on few-shot pathology diagnosis tasks. Codes will be available at https://github.com/JJ-ZHOU-Code/MAPLE.
Researcher Affiliation	Academia	1The College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics 2The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education 3The School of Engneering Medicine, Beihang University
Pseudocode	Yes	Algorithm 1 LLM-powered Prompt Construction
Open Source Code	No	Codes will be available at https://github.com/JJ-ZHOU-Code/MAPLE. The source code will be publicly available upon acceptance of the paper.
Open Datasets	Yes	Datasets. We evaluate MAPLE on three benchmark WSI datasets from The Cancer Genome Atlas (TCGA): TCGA-BRCA, TCGA-RCC, and TCGA-NSCLC. More details for the classification task on each cohort are provided in Appendix B.1. To simulate the few-shot learning scenario in clinical practice, we randomly sample K WSIs per class, where (K = 4, 8, 16 in our implementation). 2https://portal.gdc.cancer.gov
Dataset Splits	Yes	To simulate the few-shot learning scenario in clinical practice, we randomly sample K WSIs per class, where (K = 4, 8, 16 in our implementation). We conduct five-fold cross-validation and the mean and standard deviation are calculated according to the results of all folds.
Hardware Specification	Yes	All experiments are conducted using Py Torch 2.0.1 and CUDA 11.7 on Python 3.8 with NVIDIA RTX 3090 GPUs.
Software Dependencies	Yes	All experiments are conducted using Py Torch 2.0.1 and CUDA 11.7 on Python 3.8 with NVIDIA RTX 3090 GPUs.
Experiment Setup	Yes	GPT-4 [2] is taken as the frozen large language model (LLM). For hyperparameter settings, the the number of entities at each scale nk is tuned from 4 to 20 with an interval of 4 (Section 4.1), while the number of neighbors for constructing the cross-scale entity graph ne = 7 is tuned from 1 to 13 with interval 2 (Section 4.4). Since WSIs are with different numbers of divided patches, we select the top %r percentage tumor-related patches for each WSI as the the top-k patches, and we tune r from 0.1 to 1 with interval 0.2. Finally, the weighting parameter λ (Section 4.5) for combining entity-level and slide-level predictions is tuned from 0 to 1 with interval 0.1. The model is optimized using Adam W with a learning rate of 1 10 4 and trained for up to 80 epochs with early stopping based on validation performance.