Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Lyapunov-based Approach to Safe Reinforcement Learning

Authors: Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

NeurIPS 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance. We evaluate their learning performance on two variants: one in which the observation is a one-hot encoding of the agent s location, and the other in which the observation is the 2D image representation of the grid map. In each of these, we evaluate performance when d0 = 1 and d0 = 5.
Researcher Affiliation Industry Yinlam Chow Deep Mind EMAIL Ofir Nachum Google Brain EMAIL Edgar Duenez-Guzman Deep Mind EMAIL Mohammad Ghavamzadeh Facebook AI Research EMAIL
Pseudocode Yes Algorithm 1 Safe Policy Iteration (SPI) Algorithm 2 Safe Value Iteration (SVI)
Open Source Code No No explicit statement or link providing access to the authors' source code was found.
Open Datasets No No concrete access information (link, DOI, repository, or formal citation with authors/year) is provided for the "stochastic 2D grid-world motion planning problem" dataset, which appears to be custom-generated.
Dataset Splits No The paper does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets).
Hardware Specification No No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments are provided in the paper.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies or libraries used in the experiments.
Experiment Setup Yes For demonstration purposes, we choose a 25 25 grid-world (see Figure 1) with a total of 625 states. We also have a density ratio ρ (0, 1) that sets the obstacle-to-terrain ratio. When ρ is close to 0, the problem is obstacle-free, and if ρ is close to 1, then the problem becomes more challenging. In the normal problem setting, we choose a density ρ = 0.3, an error probability δ = 0.05, a constraint threshold d0 = 5, and a maximum horizon of 200 steps. The initial state is located in (24, 24) and the goal is placed in (0, α), where α [0, 24] is a uniform random variable. To account for statistical significance, the results of each experiment are averaged over 20 trials.