reproducibilityindex.ai

Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Authors: Shengpu Tang, Aditya Modi, Michael Sjoding, Jenna Wiens

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze the theoretical properties of the proposed algorithm, providing optimality guarantees and demonstrate our approach on simulated environments and a real clinical task. Empirically, the proposed algorithm exhibits good convergence properties and discovers meaningful near-equivalent actions.
Researcher Affiliation	Academia	1Department of Electrical Engineering & Computer Science, University of Michigan, Ann Arbor, US 2Department of Internal Medicine, Michigan Medicine, University of Michigan, Ann Arbor, US 3Institute for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, US.
Pseudocode	Yes	Algorithm 1 TD learning for near-greedy ζ-optimal SVP
Open Source Code	Yes	The code to reproduce our experiments is available online1. 1https://gitlab.eecs.umich.edu/MLD3/RL-Set-Valued-Policy
Open Datasets	Yes	Applying the speciﬁed inclusion and exclusion criteria (Komorowski et al., 2018) to the MIMIC-III database (Johnson et al., 2016), we identiﬁed a cohort of 20,940 patients with sepsis (Table 1).
Dataset Splits	Yes	The cohort was split into 70% training, 10% validation and 20% test.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions software like 'Open AI Gym' and 'Q-learning' but does not specify version numbers for any key software components or libraries required for replication.
Experiment Setup	Yes	γ is set to 0.99 to place nearly as much importance on late deaths as early deaths. During training, each episode is generated by randomly sampling a patient trajectory from the training set (with replacement). Given the complexity of this environment, to improve convergence, we exponentially decay the step size α every 1, 000 episodes. We train the RL agent for 1, 000, 000 episodes, after which TD errors stabilize and the estimated Q-values reach plateaus.