Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Authors: Aaron Sonabend, Junwei Lu, Leo Anthony Celi, Tianxi Cai, Peter Szolovits

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform several analyses to assess ESRL policy learning, sensitivity to the risk aversion parameter α, value function estimation, and finally illustrate how we can interpret the posteriors within the context of the application. Figure 1 (a) shows mean reward for T = 200 episodes while varying ϵ.
Researcher Affiliation Academia Aaron Sonabend-W Harvard University asonabend@g.harvard.edu Junwei Lu Harvard University junweilu@hsph.harvard.edu Leo A. Celi MIT lceli@mit.edu Tianxi Cai Harvard University tcai@hsph.harvard.edu Peter Szolovits MIT psz@mit.edu
Pseudocode Yes Algorithm 1: Expert-Supervised RL
Open Source Code Yes The code for implementing ESRL with detailed comments is publicly available1. 1https://github.com/asonabend/ESRL
Open Datasets Yes We use the Riverswim environment [28], and a Sepsis data set built from MIMIC-III data [29]. [29] L. Shen L.-W. H. Lehman M. Feng M. Ghassemi B. Moody P. Szolovits L. Anthony Celi A. E. W. Johnson, T. J. Pollard and R. G. Mark. MIMIC-III. A freely accessible critical care database. Scientific Data, 4(160035), 2016.
Dataset Splits Yes The data set used has 12,991 episodes of 10 time steps—measurements every 4-hour interval. We used 80% of episodes for training and 20% for testing.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or cloud instance types) used for running the experiments were found.
Software Dependencies No No specific software dependencies with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) were found.
Experiment Setup Yes For Riverswim we use 2-128 unit layers, for Sepsis we use 128, 256 unit layers respectively [31]. For ESRL, we use conjugate Dirichlet/multinomial, and normal-gamma/normal for the prior and likelihood of the transition and reward functions respectively. We train policy π0 using PSRL [16] for 10,000 episodes, we then generate data set DT with π, varying both size T and noise ϵ. The offline trained policies are then tested on the environment for 10,000 episodes.