Towards Safe Policy Improvement for Non-Stationary MDPs
Authors: Yash Chandak, Scott Jordan, Georgios Theocharous, Martha White, Philip S. Thomas
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide an empirical analysis on two domains inspired by safety-critical real-world problems that exhibit non-stationarity. In the following, we first briefly discuss these domains, and in Figure 4 we present a summary of results for eight settings (four for each domain). A more detailed description of the domains and the experimental setup is available in Appendix F. |
| Researcher Affiliation | Collaboration | Yash Chandak University of Massachusetts ychandak@cs.umass.edu Scott M. Jordan University of Massachusetts sjordan@cs.umass.edu Georgios Theocharous Adobe Research theochar@adobe.com Martha White University of Alberta & Amii whitem@alberta.ca Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu |
| Pseudocode | Yes | More elaborate details and complete algorithms are deferred to Appendix E. |
| Open Source Code | No | The paper does not explicitly state that the source code for their method is open-source or provide a link to a repository. |
| Open Datasets | Yes | Non-Stationary Diabetes Treatment: This environment is based on an open-source implementation [68] of the FDA approved Type-1 Diabetes Mellitus simulator (T1DMS) [44, 37]. |
| Dataset Splits | Yes | To address this problem, we partition D into two mutually exclusive sets, namely Dtrain and Dtest, such that only Dtrain is used to search for a candidate policy πc and only Dtest is used during the safety test. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as CPU models, GPU models, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Simglucose v0.2.1 (2018)' but does not list other software dependencies or their specific version numbers that would be needed to reproduce the experiment (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For both the domains, (a) we set πsafe to a near-optimal policy for the starting MDP M1, representing how a doctor would have set the treatment initially, or how an expert would have set the recommendations, (b) we set the safety level (1 α) to 95%, (c) we modulate the speed of non-stationarity, such that higher speeds represent a faster rate of non-stationarity and a speed of zero represents a stationary domain... For all experiments, we ran a total of 2000 episodes for each setting. The discount factor γ was set to 0.999. Candidate policies are searched using Monte Carlo policy gradient search with linear function approximation (features are polynomials of order 5) and the number of iterations for gradient search is fixed to 10. |