reproducibilityindex.ai

A Lyapunov-based Approach to Safe Reinforcement Learning

Authors: Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method signiﬁcantly outperforms existing baselines in balancing constraint satisfaction and performance. We evaluate their learning performance on two variants: one in which the observation is a one-hot encoding of the agent s location, and the other in which the observation is the 2D image representation of the grid map. In each of these, we evaluate performance when d0 = 1 and d0 = 5.
Researcher Affiliation	Industry	Yinlam Chow Deep Mind yinlamchow@google.com Oﬁr Nachum Google Brain ofirnachum@google.com Edgar Duenez-Guzman Deep Mind duenez@google.com Mohammad Ghavamzadeh Facebook AI Research mgh@fb.com
Pseudocode	Yes	Algorithm 1 Safe Policy Iteration (SPI) Algorithm 2 Safe Value Iteration (SVI)
Open Source Code	No	No explicit statement or link providing access to the authors' source code was found.
Open Datasets	No	No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the "stochastic 2D grid-world motion planning problem" dataset, which appears to be custom-generated.
Dataset Splits	No	The paper does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets).
Hardware Specification	No	No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments are provided in the paper.
Software Dependencies	No	The paper does not provide specific version numbers for ancillary software dependencies or libraries used in the experiments.
Experiment Setup	Yes	For demonstration purposes, we choose a 25 25 grid-world (see Figure 1) with a total of 625 states. We also have a density ratio ρ (0, 1) that sets the obstacle-to-terrain ratio. When ρ is close to 0, the problem is obstacle-free, and if ρ is close to 1, then the problem becomes more challenging. In the normal problem setting, we choose a density ρ = 0.3, an error probability δ = 0.05, a constraint threshold d0 = 5, and a maximum horizon of 200 steps. The initial state is located in (24, 24) and the goal is placed in (0, α), where α [0, 24] is a uniform random variable. To account for statistical signiﬁcance, the results of each experiment are averaged over 20 trials.