Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging

Authors: Felix Wagner, Pramit Saha, Harry Anthony, Alison Noble, Konstantinos Kamnitsas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DIso N on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIso N performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code available at: https://github.com/FelixWag/DIsoN
Researcher Affiliation	Academia	Felix Wagner1 Pramit Saha1 Harry Anthony1 J. Alison Noble1 Konstantinos Kamnitsas1 1Department of Engineering Science, University of Oxford EMAIL
Pseudocode	Yes	Algorithm 1 Source Node: DIso N / CC-DIso N 1: function SOURCENODE(Ds, θpre) 2: if CC-DIso N : 3: receive ˆy from target 4: θf θpre; θh rand; θ(0) (θf, θh) initialize global model 5: send θ(0) to Target; θS θ(0) 6: for r = 1 to R do communication rounds 7: for e = 1 to E do local updates on Source 8: if CC-DIso N : 9: Bs {(xs, ys) Ds \| ys = ˆy} filter Bs on predicted xt class 10: else 11: sample Bs Ds 12: θS θS η θS 1 \|Bs\| P xs Bs L(θS; xs, 0) 13: receive θT from Target 14: aggregation: θ(r) α θS + (1 α) θT Eq. 2 15: Source Converged converged(θ(r), Ds) Test criteria 2 (Def. 3.1) 16: send θ(r) and Source Converged to Target 17: θS θ(r) update Source model for next comm. round Algorithm 2 Target Node: DIso N / CC-DIso N 1: function TARGETNODE(xt, Mpre, R) 2: if CC-DIso N : 3: ˆy arg maxc [Mpre(xt)]c 4: send ˆy to source 5: receive θ(0) from Source; θT θ(0) init. Target model with global model 6: for r = 1 to R do communication rounds 7: for e = 1 to E do local updates on Target 8: θT θT η θT L(θT ; xt, 1) 9: send θT to Source 10: receive updated θ(r) and Source Converged from Source 11: if converged(θ(r), xt) and Source Converged : Test crit. 1 & 2 (Def. 3.1) 12: break 13: θT θ(r) update Target model for next comm. round 14: return SDIso N(xt) = r
Open Source Code	Yes	Code available at: https://github.com/FelixWag/DIsoN
Open Datasets	Yes	We evaluate DIso N on four publicly available medical imaging benchmark datasets covering dermatology, breast ultrasound, chest X-ray, and histopathology. All datasets consist of real, clinically acquired images and no synthetic data is used. The first three benchmarks use images with naturally occurring non-diagnostic artifacts as OOD samples (e.g., rulers, pacemakers, annotations), while histopathology focuses on semantic and covariate shifts across domains. Example images are shown in Fig.3. Dermatology & Breast Ultrasound: We adopt the benchmark setup from [3], using images without artifacts as the training and ID test data, and images with artifacts (rulers and annotations) as OOD samples. For breast ultrasound (Breast MNIST [41]), the artefacts are embedded text annotations, and for dermatology (D7P [19]) the artefacts are black overlaid rulers. [...] Chest X-Ray: Following the benchmark from [2], we use frontal-view X-ray scans (from CheXpert [17]) containing no-support devices as the training and ID test data, and scans containing pacemakers as OOD samples. [...] Histopathology: We use the MIDOG benchmark from Open MIBOOD [11].
Dataset Splits	Yes	For breast ultrasound, the primary model Mpre is trained for 3-class classification (normal/benign/malignant). The 228 annotated scans with artifacts are used as OOD test samples, while the remaining artifact-free scans are split 90/10 into training and ID test sets. For dermatology, the primary model Mpre is trained for binary classification (nevus/non-nevus). The annotated 251 images with rulers are used as OOD samples and the remaining 1403 are split 90/10. [...] We use the 23,345 annotated scans without any support devices as our training data, and randomly hold out 1000 ID samples for testing. The OOD test set includes 1000 randomly sampled scans with pacemakers. [...] We use the 251 test ID samples from domain 1a and randomly sample 500 OOD samples from each of the near- and far-OOD domains. [...] Table 4: Class-wise dataset splits used in our experiments, showing the total number of ID images per class, splits into pre-training and ID test sets, OOD detection task, number of OOD test samples, and image resolution. For Histopathology, we report near-OOD (domains 2 7) and far-OOD (CCAg T, FNAC 2019) tasks separately. [...] For Dermatology (1403 artifact-free ID images) and Breast Ultrasound (552 artifact-free ID images), we follow the benchmark setup from [3], using manually annotated artifact-free images for pretraining and ID testing, and ruler/text annotation artifacts as OOD. The Chest X-ray dataset uses 23,345 frontal-view scans without support devices as ID data, following the setup from [2], with scans with pacemakers as OOD artifacts. In all three datasets, the ID data is split 90/10 into pre-training for the main task of interest and ID test sets.
Hardware Specification	Yes	Experiments were run on an NVIDIA RTX A5000.
Software Dependencies	No	The paper mentions software like ResNet18 (architecture) and optimizers like Adam and SGD, but does not specify version numbers for any software libraries or programming languages used (e.g., Python, PyTorch, TensorFlow versions). This makes it difficult to replicate the exact software environment.
Experiment Setup	Yes	We use a ResNet18 [12] with Instance Normalization (as per Sec.3.2) pre-trained on the dataset-specific task as initialization for DIsoN. DIsoN is trained with Adam (lr=0.001 for dermatology and ultrasound; 0.003 for X-ray). For histopathology SGD with momentum (lr=0.01, momentum=0.9) is used (since [11] suggests pretraining with SGD). Local iterations per communication round are chosen to approximately match one epoch on the training data. We use standard augmentations (e.g. random cropping, rotation, color-jitter) and the aggregation weight is fixed to α = 0.8, since it performs consistently well across all our experiments (see Sec. 4.2 for effect of α). More training details and hyperparameters are provided in the Appendix. [...] Table 5: Pre-training hyperparameters. These settings are used to train the main classification task before initializing DIso N. LR: learning rate; BS: batch size. Table 6: Training hyperparameters used for DIso N experiments. Overview of architecture, optimizer, learning rate (LR), and batch size (BS) used to train DIso N for each dataset. [...] To limit runtime, we also use a maximum number of communication rounds for each dataset. If convergence is not reached within this limit, we assign the maximum round Rmax as the OOD score. We use Rmax = 300 for Dermatology and Ultrasound, and Rmax = 100 for the longer-running Chest X-ray and Histopathology datasets.