reproducibilityindex.ai

Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Authors: Nuoya Xiong, Yihan Du, Longbo Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results.
Researcher Affiliation	Academia	1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Illinois at Urbana-Champaign
Pseudocode	Yes	Algorithm 1 SUCBVI Algorithm 2 SRF-UCRL
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	No	The paper describes using 'a custom MDP environment' and 'a grid world environment' for experiments, but provides no links, DOIs, or formal citations to make these datasets/environments publicly accessible.
Dataset Splits	No	The paper does not specify exact percentages or absolute sample counts for training, validation, or test splits. It mentions total steps/episodes but no data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python version, library versions).
Experiment Setup	Yes	For our Safe-RL-SW experiment, we set the number of total steps T = 500000. In each episode, the agent gets reward 10 if it arrives at the goal state at the end, and 0 otherwise. For all states except the goal state, c(s) = 0. For the goal state, c(goal) = 0.5. We consider two unsafe states with c(unsafe1) = 0.6 and c(unsafe2) = 0.7. The safety threshold τ = 0.5. For SUCBVI, we set the confidence level δ = 0.05. For UCBVI, we set its parameter to be 0.05. For Opt CMDP-bonus, we set its parameter to be 0.05. For Triple-Q, we set its parameter to be 0.05. For Optpess, we set its parameter to be 0.05. We also tune the learning rate for these algorithms. For Safe-RFE-SW, we set the total number of episodes K = 50000, and ε = 0.1, δ = 0.05. We use a grid world environment with 25 states, 4 actions, and 10 horizon. There is one goal state with c(goal) = 0.5, one unsafe state with c(unsafe) = 0.6. The safety threshold τ = 0.5.