Safe Reinforcement Learning in Constrained Markov Decision Processes
Authors: Akifumi Wachi, Yanan Sui
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openlyavailable environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data. |
| Researcher Affiliation | Collaboration | 1IBM Research AI, Tokyo, Japan 2Tsinghua University, Beijing, China. Correspondence to: Akifumi Wachi <akifumi.wachi@ibm.com>, Yanan Sui <ysui@tsinghua.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 SNO-MDP with ES2 |
| Open Source Code | Yes | We build an openly-available test-bed called GP-SAFETY-GYM for synthetic experiments.1 The safety and efficiency of SNO-MDP are then evaluated with two experiments: one in the GP-SAFETY-GYM synthetic environment, and the other using real Mars terrain data. 1https://github.com/akifumi-wachi-4/safe_ near_optimal_mdp |
| Open Datasets | No | The paper mentions using a "synthetic data in a new, openly-available environment named GP-SAFETY-GYM" and "real observation data" for Mars surface exploration. It also states "We created a 40 x 30 rectangular grid-world by clipping a region around latitude 30 6 south and longitude 202 2 east, as shown in Figure 4." However, it does not provide specific access information (link, DOI, citation with authors/year) for these datasets to confirm public availability. |
| Dataset Splits | No | The paper mentions using a 20x20 square grid for synthetic data and a 40x30 rectangular grid-world for Mars data, but does not provide specific train/validation/test splits, percentages, or absolute counts for the datasets used. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Open AI Safety-Gym" as a basis for their GP-SAFETY-GYM environment and specifies "Gaussian processes (GPs, see Rasmussen (2004))" and "Matérn kernel with ν = 5/2", "RBF kernel". However, it does not provide specific version numbers for any software components or libraries. |
| Experiment Setup | Yes | In this simulation, we allowed the agent to observe the reward and safety function values of the current state and neighboring states. The kernel for reward was a radial basis function (RBF) with the length-scales of 2 and prior variance of 1. The kernel for safety was also an RBF with the length-scales of 2 and prior variance of 1. Finally, we set the discount factor to γ = 0.99, and confidence intervals parameters to αt = 3 and βt = 2 for all t 1. ... We set the confidence levels as αt = 3 and βt = 2, t ≥ 0, and the discount factor as γ = 0.9. |