Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains
Authors: Marianne Rakic, Siyu Gai, Etienne Chollet, John Guttag, Adrian Dalca
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images. ... 4 Experiments |
| Researcher Affiliation | Academia | Marianne Rakic MIT CSAIL, MGH EMAIL Siyu Gai MIT CSAIL, MGH Etienne Chollet MIT CSAIL, MGH John V. Guttag MIT CSAIL Adrian V. Dalca MIT CSAIL, HMS, MGH |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. Figure 3 is a "Method Schematic" diagram, not a pseudocode or algorithm. |
| Open Source Code | No | Code will be made available upon acceptance. |
| Open Datasets | Yes | We train Pancakes on a large, diverse collection of biomedical data, and evaluate the multiprotocol segmentations produced on images from held-out datasets. We use Megamedical [16, 96, 123], which covers many biomedical domains [2, 3, 6, 11, 12, 14, 16, 19, 22, 29, 31, 32, 35, 37, 41, 42, 47, 49, 51, 52, 54, 61 67, 69, 71 73, 75, 78 82, 87, 93, 94, 98, 101, 103, 105, 106, 108, 110, 127, 129, 132, 133]. ... The data we use comes from a collection of public datasets. They are all cited accordingly in the paper but we do not have authorization to release the data ourselves. |
| Dataset Splits | Yes | The images within each dataset are also split into training, validation, and test splits. We train the models on the training split of the training datasets. We used the training split of the development datasets to monitor out-of-distribution capabilities. We report results on the test splits of both the development (in the supplemental material) and held-out datasets. We split the dataset based on subjects, and ensured that there was no train/validation/test subject cross-contamination. |
| Hardware Specification | Yes | Our model was trained using 45G of memory on a single node of an NVIDIA DGX A100 machine using two cores. |
| Software Dependencies | No | We use the Adam W optimizer [74] with a learning rate of 0.0001 [56]. We use PRe LU activations and convolution layers with 32 features, kernel size 3 and stride 1. |
| Experiment Setup | Yes | For function fθf ( ), we use a UNet-like architecture, with convolutional layers of 32 features followed by PRe LU activation [39]. The function hθh( ) is a series of convolution layers with skip connections. We use the Soft Max function across the K dimension to obtain multi-label segmentation maps, with non-overlapping labels. ... During training, we sample the maximum number of labels K, the number of protocols M, and the set size S uniformly from a fixed range: K [5, 40], M [5, 15], S [2, 5]. ... We use the Adam W optimizer [74] with a learning rate of 0.0001 [56]. |