Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Authors: Nuoya Xiong, Yihan Du, Longbo Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Illinois at Urbana-Champaign
Pseudocode Yes Algorithm 1 SUCBVI Algorithm 2 SRF-UCRL
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets No The paper describes using 'a custom MDP environment' and 'a grid world environment' for experiments, but provides no links, DOIs, or formal citations to make these datasets/environments publicly accessible.
Dataset Splits No The paper does not specify exact percentages or absolute sample counts for training, validation, or test splits. It mentions total steps/episodes but no data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python version, library versions).
Experiment Setup Yes For our Safe-RL-SW experiment, we set the number of total steps T = 500000. In each episode, the agent gets reward 10 if it arrives at the goal state at the end, and 0 otherwise. For all states except the goal state, c(s) = 0. For the goal state, c(goal) = 0.5. We consider two unsafe states with c(unsafe1) = 0.6 and c(unsafe2) = 0.7. The safety threshold τ = 0.5. For SUCBVI, we set the confidence level δ = 0.05. For UCBVI, we set its parameter to be 0.05. For Opt CMDP-bonus, we set its parameter to be 0.05. For Triple-Q, we set its parameter to be 0.05. For Optpess, we set its parameter to be 0.05. We also tune the learning rate for these algorithms. For Safe-RFE-SW, we set the total number of episodes K = 50000, and ε = 0.1, δ = 0.05. We use a grid world environment with 25 states, 4 actions, and 10 horizon. There is one goal state with c(goal) = 0.5, one unsafe state with c(unsafe) = 0.6. The safety threshold τ = 0.5.