Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning
Authors: Jannik Deuschel, Caleb Ellington, Yingtao Luo, Ben Lengerich, Pascal Friederich, Eric P. Xing
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on predicting antibiotic prescription in intensive care units (+22% AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer s patients (+7.7% AUROC vs. previous SOTA). |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Karlsruhe Institute of Technology 3Broad Institute of MIT and Harvard 4MIT 5MBZUAI 6Petuum, Inc. |
| Pseudocode | No | The paper describes the methods conceptually and mathematically but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for data processing, model training, and figure generation is available on Git Hub 1. 1https://github.com/JADEUSC/ contextualized_policy_recovery |
| Open Datasets | Yes | We look at 4195 patients in the intensive care unit over up to 6 timesteps extracted from the Medical Information Mart for Intensive Care III (Johnson et al., 2016) dataset and predict antibiotic prescription based on 7 observations temperature, hematocrit, potassium, white blood cell count (WBC), blood pressure, heart rate, and creatinine. |
| Dataset Splits | Yes | Each dataset is split up into a training set (70% of patients), validation set (15% of patients) for hyperparameter tuning, and test set (15% of patients) to report model performance. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using βthe Adam optimizer (Kingma & Ba, 2014)β but does not specify version numbers for any software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | The initial learning rate chosen for CPR is 5e-4 and 1e-4 for the baseline RNNs. We select the dimensions of the hidden state for both CPR and the baseline RNNs from [16,32,64]. For CPR, Ξ» is chosen from [0.0001,0.001,0.01,0.1]. The batch size is selected as 64 for all models. Table 4 shows the optimal hyperparameters chosen based on the validation set performance. |