A Lyapunov-based Approach to Safe Reinforcement Learning
Authors: Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance. We evaluate their learning performance on two variants: one in which the observation is a one-hot encoding of the agent s location, and the other in which the observation is the 2D image representation of the grid map. In each of these, we evaluate performance when d0 = 1 and d0 = 5. |
| Researcher Affiliation | Industry | Yinlam Chow Deep Mind yinlamchow@google.com Ofir Nachum Google Brain ofirnachum@google.com Edgar Duenez-Guzman Deep Mind duenez@google.com Mohammad Ghavamzadeh Facebook AI Research mgh@fb.com |
| Pseudocode | Yes | Algorithm 1 Safe Policy Iteration (SPI) Algorithm 2 Safe Value Iteration (SVI) |
| Open Source Code | No | No explicit statement or link providing access to the authors' source code was found. |
| Open Datasets | No | No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the "stochastic 2D grid-world motion planning problem" dataset, which appears to be custom-generated. |
| Dataset Splits | No | The paper does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | No | No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | For demonstration purposes, we choose a 25 25 grid-world (see Figure 1) with a total of 625 states. We also have a density ratio ρ (0, 1) that sets the obstacle-to-terrain ratio. When ρ is close to 0, the problem is obstacle-free, and if ρ is close to 1, then the problem becomes more challenging. In the normal problem setting, we choose a density ρ = 0.3, an error probability δ = 0.05, a constraint threshold d0 = 5, and a maximum horizon of 200 steps. The initial state is located in (24, 24) and the goal is placed in (0, α), where α [0, 24] is a uniform random variable. To account for statistical significance, the results of each experiment are averaged over 20 trials. |