reproducibilityindex.ai

Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Authors: Aaron Sonabend, Junwei Lu, Leo Anthony Celi, Tianxi Cai, Peter Szolovits

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform several analyses to assess ESRL policy learning, sensitivity to the risk aversion parameter α, value function estimation, and ﬁnally illustrate how we can interpret the posteriors within the context of the application. Figure 1 (a) shows mean reward for T = 200 episodes while varying ϵ.
Researcher Affiliation	Academia	Aaron Sonabend-W Harvard University asonabend@g.harvard.edu Junwei Lu Harvard University junweilu@hsph.harvard.edu Leo A. Celi MIT lceli@mit.edu Tianxi Cai Harvard University tcai@hsph.harvard.edu Peter Szolovits MIT psz@mit.edu
Pseudocode	Yes	Algorithm 1: Expert-Supervised RL
Open Source Code	Yes	The code for implementing ESRL with detailed comments is publicly available1. 1https://github.com/asonabend/ESRL
Open Datasets	Yes	We use the Riverswim environment [28], and a Sepsis data set built from MIMIC-III data [29]. [29] L. Shen L.-W. H. Lehman M. Feng M. Ghassemi B. Moody P. Szolovits L. Anthony Celi A. E. W. Johnson, T. J. Pollard and R. G. Mark. MIMIC-III. A freely accessible critical care database. Scientiﬁc Data, 4(160035), 2016.
Dataset Splits	Yes	The data set used has 12,991 episodes of 10 time steps—measurements every 4-hour interval. We used 80% of episodes for training and 20% for testing.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or cloud instance types) used for running the experiments were found.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) were found.
Experiment Setup	Yes	For Riverswim we use 2-128 unit layers, for Sepsis we use 128, 256 unit layers respectively [31]. For ESRL, we use conjugate Dirichlet/multinomial, and normal-gamma/normal for the prior and likelihood of the transition and reward functions respectively. We train policy π0 using PSRL [16] for 10,000 episodes, we then generate data set DT with π, varying both size T and noise ϵ. The ofﬂine trained policies are then tested on the environment for 10,000 episodes.