Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Authors: Shengpu Tang, Aditya Modi, Michael Sjoding, Jenna Wiens

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze the theoretical properties of the proposed algorithm, providing optimality guarantees and demonstrate our approach on simulated environments and a real clinical task. Empirically, the proposed algorithm exhibits good convergence properties and discovers meaningful near-equivalent actions.
Researcher Affiliation Academia 1Department of Electrical Engineering & Computer Science, University of Michigan, Ann Arbor, US 2Department of Internal Medicine, Michigan Medicine, University of Michigan, Ann Arbor, US 3Institute for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, US.
Pseudocode Yes Algorithm 1 TD learning for near-greedy ζ-optimal SVP
Open Source Code Yes The code to reproduce our experiments is available online1. 1https://gitlab.eecs.umich.edu/MLD3/RL-Set-Valued-Policy
Open Datasets Yes Applying the specified inclusion and exclusion criteria (Komorowski et al., 2018) to the MIMIC-III database (Johnson et al., 2016), we identified a cohort of 20,940 patients with sepsis (Table 1).
Dataset Splits Yes The cohort was split into 70% training, 10% validation and 20% test.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions software like 'Open AI Gym' and 'Q-learning' but does not specify version numbers for any key software components or libraries required for replication.
Experiment Setup Yes γ is set to 0.99 to place nearly as much importance on late deaths as early deaths. During training, each episode is generated by randomly sampling a patient trajectory from the training set (with replacement). Given the complexity of this environment, to improve convergence, we exponentially decay the step size α every 1, 000 episodes. We train the RL agent for 1, 000, 000 episodes, after which TD errors stabilize and the estimated Q-values reach plateaus.