Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Probabilistic U-Net for Segmentation of Ambiguous Images
Authors: Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R. Ledsam, Klaus Maier-Hein, S. M. Ali Eslami, Danilo Jimenez Rezende, Olaf Ronneberger
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show on a lung abnormalities segmentation task and on a Cityscapes segmentation task that our model reproduces the possible segmentation variants as well as the frequencies with which they occur, doing so significantly better than published approaches. |
| Researcher Affiliation | Collaboration | 1Deep Mind, London, UK 2Division of Medical Image Computing, German Cancer Research Center, Heidelberg, Germany EMAIL EMAIL |
| Pseudocode | No | The paper does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures. |
| Open Source Code | Yes | An open source re-implementation of our approach can be found at https://github.com/Simon Kohl/probabilistic_unet. |
| Open Datasets | Yes | Here we consider two datasets: The LIDC-IDRI dataset [32, 33, 34] which contains 4 annotations per input, and the Cityscapes dataset [35] |
| Dataset Splits | Yes | For our experiments we split this dataset into a training set composed of 722 patients, a validation set composed of 144 patients, and a test set composed of the remaining 144 patients. ... and split off 274 images (corresponding to the 3 cities of Darmstadt, Mönchengladbach and Ulm) from the official training set as our internal validation set. |
| Hardware Specification | Yes | For our experiments we used 8 Tesla P100 GPUs. |
| Software Dependencies | No | The paper mentions using the "Adam optimizer [37]" but does not specify versions for programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries. |
| Experiment Setup | Yes | We train for 1500 epochs... using the Adam optimizer [37] with default parameters and a learning rate of 1e-4. We decay the learning rate by a factor of 10 at epochs 1000 and 1250. A batch size of 20 is used... |