reproducibilityindex.ai

Towards Safe Policy Improvement for Non-Stationary MDPs

Authors: Yash Chandak, Scott Jordan, Georgios Theocharous, Martha White, Philip S. Thomas

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we provide an empirical analysis on two domains inspired by safety-critical real-world problems that exhibit non-stationarity. In the following, we ﬁrst brieﬂy discuss these domains, and in Figure 4 we present a summary of results for eight settings (four for each domain). A more detailed description of the domains and the experimental setup is available in Appendix F.
Researcher Affiliation	Collaboration	Yash Chandak University of Massachusetts ychandak@cs.umass.edu Scott M. Jordan University of Massachusetts sjordan@cs.umass.edu Georgios Theocharous Adobe Research theochar@adobe.com Martha White University of Alberta & Amii whitem@alberta.ca Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu
Pseudocode	Yes	More elaborate details and complete algorithms are deferred to Appendix E.
Open Source Code	No	The paper does not explicitly state that the source code for their method is open-source or provide a link to a repository.
Open Datasets	Yes	Non-Stationary Diabetes Treatment: This environment is based on an open-source implementation [68] of the FDA approved Type-1 Diabetes Mellitus simulator (T1DMS) [44, 37].
Dataset Splits	Yes	To address this problem, we partition D into two mutually exclusive sets, namely Dtrain and Dtest, such that only Dtrain is used to search for a candidate policy πc and only Dtest is used during the safety test.
Hardware Specification	No	The paper does not provide any specific hardware details such as CPU models, GPU models, or cloud computing instance types used for running the experiments.
Software Dependencies	No	The paper mentions 'Simglucose v0.2.1 (2018)' but does not list other software dependencies or their specific version numbers that would be needed to reproduce the experiment (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For both the domains, (a) we set πsafe to a near-optimal policy for the starting MDP M1, representing how a doctor would have set the treatment initially, or how an expert would have set the recommendations, (b) we set the safety level (1 α) to 95%, (c) we modulate the speed of non-stationarity, such that higher speeds represent a faster rate of non-stationarity and a speed of zero represents a stationary domain... For all experiments, we ran a total of 2000 episodes for each setting. The discount factor γ was set to 0.999. Candidate policies are searched using Monte Carlo policy gradient search with linear function approximation (features are polynomials of order 5) and the number of iterations for gradient search is fixed to 10.