Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stratify or Die: Rethinking Data Splits in Image Segmentation

Authors: Naga Venkata Sai Jitin Jami, Thomas Altstidl, Jonas Mueller, Jindong Li, Dario Zanca, Bjoern Eskofier, Heike Leutheuser

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate IPS, WDES, and random splitting across five benchmark segmentation datasets. Our evaluation begins by analyzing the consistency of folds generated through repeated experiments. We further assess the impact of targeted stratification versus random splitting by examining the standard deviation of accuracy, F1-score, and Intersection over Union (IoU) across 10-fold cross-validation. A lower deviation indicates a more reliable assessment of model performance. Empirical results show that WDES consistently achieves the highest quality splits.
Researcher Affiliation	Academia	1 Machine Learning and Data Analytics Lab, FAU Erlangen-Nürnberg, Germany 2 Ambient Assisted Living and Medical Assistance Systems, Department of Computer Science, University of Bayreuth, Germany EMAIL EMAIL EMAIL
Pseudocode	Yes	Appendix B Iterative Pixel Stratification Algorithm [...] Algorithm 1 Iterative Pixel Stratification [...] Appendix C Wasserstein-Driven Evolutionary Stratification Algorithm [...] Algorithm 2 Wasserstein-Driven Evolutionary Stratification (WDES)
Open Source Code	Yes	Implementation available at: https://github.com/jitinjami/Semantic Stratification
Open Datasets	Yes	We select datasets spanning four major application domains: autonomous driving/street scenes, medical images, satellite imagery, and general-purpose datasets. For autonomous driving, we use Cityscapes [21] and Cam Vid [18]. In the medical domain, we consider Endo Vis2018, a robotic scene segmentation dataset from MICCAI 2018 [19]. For satellite imagery, we use the Love DA dataset [20], which focuses on land-cover segmentation in urban and rural locations. For general-purpose segmentation tasks, we include Pascal VOC 2012 [17].
Dataset Splits	Yes	Following this, we perform 10-fold cross-validation tests with random splitting and the targeted stratification strategies.
Hardware Specification	Yes	All experiments are conducted on a single Nvidia A100 graphics card with 40GB of VRAM, without distributed training, ensuring consistent and comparable results across the different stratification methods. [...] We evaluate the runtime performance of our stratification methods on an Apple Mac Book Pro with an M3 Pro processor and 36 GB of memory.
Software Dependencies	No	We conduct a comparative analysis of three stratification strategies: a) random sampling using KFold from the scikit-learn library [34], b) Iterative Pixel Stratification, and c) Wasserstein-Driven Evolutionary Stratification implemented using the deap library [35]. For every fold, we train a UNet [6] with a resnet34 encoder from [36] for 50 epochs to perform segmentation on the images.
Experiment Setup	Yes	For every fold, we train a UNet [6] with a resnet34 encoder from [36] for 50 epochs to perform segmentation on the images. The training process employs a learning rate of 2e-4 and utilizes the Adam optimizer with Dice Loss. For Pascal VOC, CELoss is used instead, with a learning rate of 1e-4 and trained for 100 epochs. [...] The applicable parameters used for WDES are outlined in Appendix D.