Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Safe Reinforcement Learning in Constrained Markov Decision Processes
Authors: Akifumi Wachi, Yanan Sui
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openlyavailable environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data. |
| Researcher Affiliation | Collaboration | 1IBM Research AI, Tokyo, Japan 2Tsinghua University, Beijing, China. Correspondence to: Akifumi Wachi <EMAIL>, Yanan Sui <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SNO-MDP with ES2 |
| Open Source Code | Yes | We build an openly-available test-bed called GP-SAFETY-GYM for synthetic experiments.1 The safety and efficiency of SNO-MDP are then evaluated with two experiments: one in the GP-SAFETY-GYM synthetic environment, and the other using real Mars terrain data. 1https://github.com/akifumi-wachi-4/safe_ near_optimal_mdp |
| Open Datasets | No | The paper mentions using a "synthetic data in a new, openly-available environment named GP-SAFETY-GYM" and "real observation data" for Mars surface exploration. It also states "We created a 40 x 30 rectangular grid-world by clipping a region around latitude 30 6 south and longitude 202 2 east, as shown in Figure 4." However, it does not provide specific access information (link, DOI, citation with authors/year) for these datasets to confirm public availability. |
| Dataset Splits | No | The paper mentions using a 20x20 square grid for synthetic data and a 40x30 rectangular grid-world for Mars data, but does not provide specific train/validation/test splits, percentages, or absolute counts for the datasets used. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Open AI Safety-Gym" as a basis for their GP-SAFETY-GYM environment and specifies "Gaussian processes (GPs, see Rasmussen (2004))" and "Matérn kernel with ν = 5/2", "RBF kernel". However, it does not provide specific version numbers for any software components or libraries. |
| Experiment Setup | Yes | In this simulation, we allowed the agent to observe the reward and safety function values of the current state and neighboring states. The kernel for reward was a radial basis function (RBF) with the length-scales of 2 and prior variance of 1. The kernel for safety was also an RBF with the length-scales of 2 and prior variance of 1. Finally, we set the discount factor to γ = 0.99, and confidence intervals parameters to αt = 3 and βt = 2 for all t 1. ... We set the confidence levels as αt = 3 and βt = 2, t ≥ 0, and the discount factor as γ = 0.9. |