Provably Safe Reinforcement Learning with Step-wise Violation Constraints
Authors: Nuoya Xiong, Yihan Du, Longbo Huang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Illinois at Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1 SUCBVI Algorithm 2 SRF-UCRL |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | No | The paper describes using 'a custom MDP environment' and 'a grid world environment' for experiments, but provides no links, DOIs, or formal citations to make these datasets/environments publicly accessible. |
| Dataset Splits | No | The paper does not specify exact percentages or absolute sample counts for training, validation, or test splits. It mentions total steps/episodes but no data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python version, library versions). |
| Experiment Setup | Yes | For our Safe-RL-SW experiment, we set the number of total steps T = 500000. In each episode, the agent gets reward 10 if it arrives at the goal state at the end, and 0 otherwise. For all states except the goal state, c(s) = 0. For the goal state, c(goal) = 0.5. We consider two unsafe states with c(unsafe1) = 0.6 and c(unsafe2) = 0.7. The safety threshold τ = 0.5. For SUCBVI, we set the confidence level δ = 0.05. For UCBVI, we set its parameter to be 0.05. For Opt CMDP-bonus, we set its parameter to be 0.05. For Triple-Q, we set its parameter to be 0.05. For Optpess, we set its parameter to be 0.05. We also tune the learning rate for these algorithms. For Safe-RFE-SW, we set the total number of episodes K = 50000, and ε = 0.1, δ = 0.05. We use a grid world environment with 25 states, 4 actions, and 10 horizon. There is one goal state with c(goal) = 0.5, one unsafe state with c(unsafe) = 0.6. The safety threshold τ = 0.5. |