Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Timely Clinical Diagnosis through Active Test Selection

Authors: Silas Ruhrberg Estévez, Nicolás Astorga, Mihaela van der Schaar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
Researcher Affiliation	Academia	Silas Ruhrberg Estévez University of Cambridge Cambridge, UK EMAIL Nicolás Astorga University of Cambridge Cambridge, UK EMAIL Mihaela van der Schaar University of Cambridge Cambridge, UK EMAIL
Pseudocode	Yes	Pseudocode for the Bayesian selection using the KL-divergence is given in Algorithm 1. ... Algorithm 1 KL-guided Diagnostic Test Selection
Open Source Code	Yes	The code and datasets to reproduce the main findings of this paper are available under https://github.com/Sr933/actmed.
Open Datasets	Yes	Chronic Kidney Disease... The dataset is available from the UCI Machine Learning Repository under a CC BY 4.0 license: https://archive.ics.uci.edu/dataset/336/ chronic+kidney+disease. Hepatitis... The dataset is publicly available from the UCI Repository under a CC BY 4.0 license: https://archive.ics.uci.edu/dataset/571/hcv+data. Diabetes... The dataset is available on Kaggle under a CC0 Public Domain license: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database. OSCE... We release the modified OSCE dataset alongside our code to facilitate replication and comparison.
Dataset Splits	No	The paper evaluates LLM-based models in a zero-shot setting or by applying the framework per patient, rather than training a model on the provided datasets and detailing dataset splits for that purpose. While it mentions 'random subset' for data selection and 'evaluation folds' in Table 11, the methodology for standard training/test/validation splits for model development or evaluation is not explicitly described.
Hardware Specification	No	The paper states that experiments used GPT-4o and GPT-4o-mini via 'Azure Open AI Service' and open-source models (Biomistral-7B, LLaMA-70B) via 'Hugging Face using v LLM'. However, it does not provide specific hardware details such as GPU or CPU models, memory, or processor types.
Software Dependencies	Yes	All experiments were implemented using GPT-4o (Version 2024-11-20) and GPT-4o-mini (Version 2024-07-18) as provided on the Azure Open AI Service. ... Opens-source experiments were run suing Biomistral-7B and Llama70B version 3.3 as provided on Hugging Face using v LLM.
Experiment Setup	Yes	To ensure robustness, each experiment was run across 5 different random seeds. ... For all experiments, we set the number of sampled test outcomes or risk probability distributions to 10. To ensure the sampling produced a more diverse set of responses, we used a temperature of 1 and specifically instructed the model in the prompts to simulate randomness. ... We evaluated this criterion using thresholds γ {0.3, 0.5, 0.7}.