Safe Reinforcement Learning in Constrained Markov Decision Processes

Authors: Akifumi Wachi, Yanan Sui

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openlyavailable environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data.
Researcher Affiliation Collaboration 1IBM Research AI, Tokyo, Japan 2Tsinghua University, Beijing, China. Correspondence to: Akifumi Wachi <akifumi.wachi@ibm.com>, Yanan Sui <ysui@tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1 SNO-MDP with ES2
Open Source Code Yes We build an openly-available test-bed called GP-SAFETY-GYM for synthetic experiments.1 The safety and efficiency of SNO-MDP are then evaluated with two experiments: one in the GP-SAFETY-GYM synthetic environment, and the other using real Mars terrain data. 1https://github.com/akifumi-wachi-4/safe_ near_optimal_mdp
Open Datasets No The paper mentions using a "synthetic data in a new, openly-available environment named GP-SAFETY-GYM" and "real observation data" for Mars surface exploration. It also states "We created a 40 x 30 rectangular grid-world by clipping a region around latitude 30 6 south and longitude 202 2 east, as shown in Figure 4." However, it does not provide specific access information (link, DOI, citation with authors/year) for these datasets to confirm public availability.
Dataset Splits No The paper mentions using a 20x20 square grid for synthetic data and a 40x30 rectangular grid-world for Mars data, but does not provide specific train/validation/test splits, percentages, or absolute counts for the datasets used.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies No The paper mentions using "Open AI Safety-Gym" as a basis for their GP-SAFETY-GYM environment and specifies "Gaussian processes (GPs, see Rasmussen (2004))" and "Matérn kernel with ν = 5/2", "RBF kernel". However, it does not provide specific version numbers for any software components or libraries.
Experiment Setup Yes In this simulation, we allowed the agent to observe the reward and safety function values of the current state and neighboring states. The kernel for reward was a radial basis function (RBF) with the length-scales of 2 and prior variance of 1. The kernel for safety was also an RBF with the length-scales of 2 and prior variance of 1. Finally, we set the discount factor to γ = 0.99, and confidence intervals parameters to αt = 3 and βt = 2 for all t 1. ... We set the confidence levels as αt = 3 and βt = 2, t ≥ 0, and the discount factor as γ = 0.9.