A Lyapunov-based Approach to Safe Reinforcement Learning

Authors: Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance. We evaluate their learning performance on two variants: one in which the observation is a one-hot encoding of the agent s location, and the other in which the observation is the 2D image representation of the grid map. In each of these, we evaluate performance when d0 = 1 and d0 = 5.
Researcher Affiliation Industry Yinlam Chow Deep Mind yinlamchow@google.com Ofir Nachum Google Brain ofirnachum@google.com Edgar Duenez-Guzman Deep Mind duenez@google.com Mohammad Ghavamzadeh Facebook AI Research mgh@fb.com
Pseudocode Yes Algorithm 1 Safe Policy Iteration (SPI) Algorithm 2 Safe Value Iteration (SVI)
Open Source Code No No explicit statement or link providing access to the authors' source code was found.
Open Datasets No No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the "stochastic 2D grid-world motion planning problem" dataset, which appears to be custom-generated.
Dataset Splits No The paper does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets).
Hardware Specification No No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments are provided in the paper.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies or libraries used in the experiments.
Experiment Setup Yes For demonstration purposes, we choose a 25 25 grid-world (see Figure 1) with a total of 625 states. We also have a density ratio ρ (0, 1) that sets the obstacle-to-terrain ratio. When ρ is close to 0, the problem is obstacle-free, and if ρ is close to 1, then the problem becomes more challenging. In the normal problem setting, we choose a density ρ = 0.3, an error probability δ = 0.05, a constraint threshold d0 = 5, and a maximum horizon of 200 steps. The initial state is located in (24, 24) and the goal is placed in (0, α), where α [0, 24] is a uniform random variable. To account for statistical significance, the results of each experiment are averaged over 20 trials.