Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies
Authors: Shengpu Tang, Aditya Modi, Michael Sjoding, Jenna Wiens
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze the theoretical properties of the proposed algorithm, providing optimality guarantees and demonstrate our approach on simulated environments and a real clinical task. Empirically, the proposed algorithm exhibits good convergence properties and discovers meaningful near-equivalent actions. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering & Computer Science, University of Michigan, Ann Arbor, US 2Department of Internal Medicine, Michigan Medicine, University of Michigan, Ann Arbor, US 3Institute for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, US. |
| Pseudocode | Yes | Algorithm 1 TD learning for near-greedy ζ-optimal SVP |
| Open Source Code | Yes | The code to reproduce our experiments is available online1. 1https://gitlab.eecs.umich.edu/MLD3/RL-Set-Valued-Policy |
| Open Datasets | Yes | Applying the specified inclusion and exclusion criteria (Komorowski et al., 2018) to the MIMIC-III database (Johnson et al., 2016), we identified a cohort of 20,940 patients with sepsis (Table 1). |
| Dataset Splits | Yes | The cohort was split into 70% training, 10% validation and 20% test. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'Open AI Gym' and 'Q-learning' but does not specify version numbers for any key software components or libraries required for replication. |
| Experiment Setup | Yes | γ is set to 0.99 to place nearly as much importance on late deaths as early deaths. During training, each episode is generated by randomly sampling a patient trajectory from the training set (with replacement). Given the complexity of this environment, to improve convergence, we exponentially decay the step size α every 1, 000 episodes. We train the RL agent for 1, 000, 000 episodes, after which TD errors stabilize and the estimated Q-values reach plateaus. |