Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains

Authors: Marianne Rakic, Siyu Gai, Etienne Chollet, John Guttag, Adrian Dalca

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images. ... 4 Experiments
Researcher Affiliation	Academia	Marianne Rakic MIT CSAIL, MGH EMAIL Siyu Gai MIT CSAIL, MGH Etienne Chollet MIT CSAIL, MGH John V. Guttag MIT CSAIL Adrian V. Dalca MIT CSAIL, HMS, MGH
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the paper. Figure 3 is a "Method Schematic" diagram, not a pseudocode or algorithm.
Open Source Code	No	Code will be made available upon acceptance.
Open Datasets	Yes	We train Pancakes on a large, diverse collection of biomedical data, and evaluate the multiprotocol segmentations produced on images from held-out datasets. We use Megamedical [16, 96, 123], which covers many biomedical domains [2, 3, 6, 11, 12, 14, 16, 19, 22, 29, 31, 32, 35, 37, 41, 42, 47, 49, 51, 52, 54, 61 67, 69, 71 73, 75, 78 82, 87, 93, 94, 98, 101, 103, 105, 106, 108, 110, 127, 129, 132, 133]. ... The data we use comes from a collection of public datasets. They are all cited accordingly in the paper but we do not have authorization to release the data ourselves.
Dataset Splits	Yes	The images within each dataset are also split into training, validation, and test splits. We train the models on the training split of the training datasets. We used the training split of the development datasets to monitor out-of-distribution capabilities. We report results on the test splits of both the development (in the supplemental material) and held-out datasets. We split the dataset based on subjects, and ensured that there was no train/validation/test subject cross-contamination.
Hardware Specification	Yes	Our model was trained using 45G of memory on a single node of an NVIDIA DGX A100 machine using two cores.
Software Dependencies	No	We use the Adam W optimizer [74] with a learning rate of 0.0001 [56]. We use PRe LU activations and convolution layers with 32 features, kernel size 3 and stride 1.
Experiment Setup	Yes	For function fθf ( ), we use a UNet-like architecture, with convolutional layers of 32 features followed by PRe LU activation [39]. The function hθh( ) is a series of convolution layers with skip connections. We use the Soft Max function across the K dimension to obtain multi-label segmentation maps, with non-overlapping labels. ... During training, we sample the maximum number of labels K, the number of protocols M, and the set size S uniformly from a fixed range: K [5, 40], M [5, 15], S [2, 5]. ... We use the Adam W optimizer [74] with a learning rate of 0.0001 [56].