Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Timely Clinical Diagnosis through Active Test Selection

Authors: Silas Ruhrberg Estévez, Nicolás Astorga, Mihaela van der Schaar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
Researcher Affiliation Academia Silas Ruhrberg Estévez University of Cambridge Cambridge, UK EMAIL Nicolás Astorga University of Cambridge Cambridge, UK EMAIL Mihaela van der Schaar University of Cambridge Cambridge, UK EMAIL
Pseudocode Yes Pseudocode for the Bayesian selection using the KL-divergence is given in Algorithm 1. ... Algorithm 1 KL-guided Diagnostic Test Selection
Open Source Code Yes The code and datasets to reproduce the main findings of this paper are available under https://github.com/Sr933/actmed.
Open Datasets Yes Chronic Kidney Disease... The dataset is available from the UCI Machine Learning Repository under a CC BY 4.0 license: https://archive.ics.uci.edu/dataset/336/ chronic+kidney+disease. Hepatitis... The dataset is publicly available from the UCI Repository under a CC BY 4.0 license: https://archive.ics.uci.edu/dataset/571/hcv+data. Diabetes... The dataset is available on Kaggle under a CC0 Public Domain license: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database. OSCE... We release the modified OSCE dataset alongside our code to facilitate replication and comparison.
Dataset Splits No The paper evaluates LLM-based models in a zero-shot setting or by applying the framework per patient, rather than training a model on the provided datasets and detailing dataset splits for that purpose. While it mentions 'random subset' for data selection and 'evaluation folds' in Table 11, the methodology for standard training/test/validation splits for model development or evaluation is not explicitly described.
Hardware Specification No The paper states that experiments used GPT-4o and GPT-4o-mini via 'Azure Open AI Service' and open-source models (Biomistral-7B, LLaMA-70B) via 'Hugging Face using v LLM'. However, it does not provide specific hardware details such as GPU or CPU models, memory, or processor types.
Software Dependencies Yes All experiments were implemented using GPT-4o (Version 2024-11-20) and GPT-4o-mini (Version 2024-07-18) as provided on the Azure Open AI Service. ... Opens-source experiments were run suing Biomistral-7B and Llama70B version 3.3 as provided on Hugging Face using v LLM.
Experiment Setup Yes To ensure robustness, each experiment was run across 5 different random seeds. ... For all experiments, we set the number of sampled test outcomes or risk probability distributions to 10. To ensure the sampling produced a more diverse set of responses, we used a temperature of 1 and specifically instructed the model in the prompts to simulate randomness. ... We evaluated this criterion using thresholds γ {0.3, 0.5, 0.7}.