Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations
Authors: Yuping Luo, Tengyu Ma
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical simulations show that zero safety violations are already challenging for a suite of simple environments with only 2-4 dimensional state space, especially if high-reward policies have to visit regions near the safety boundary. Prior methods require hundreds of violations to achieve decent rewards on these tasks, whereas our proposed algorithms incur zero violations. |
| Researcher Affiliation | Academia | Yuping Luo Princeton University yupingl@cs.princeton.edu; Tengyu Ma Stanford University tengyuma@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Learning barrier certificate hφ for a policy π w.r.t. a calibrated dynamics model b T.; Algorithm 2 CRABS: Co-trained Barrier Certificate for Safe RL (Details in Section 4); Algorithm 3 Safe exploration with safeguard policy πsafeguard |
| Open Source Code | No | The paper does not provide any explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | The task is based on Pendulum-v0 in Open AI Gym [Brockman et al., 2016], as shown in Figure 1a. The task is based on a cart pole and the goal is to move a cart (the yellow block) to control the pole (with color teal), as shown in Figure 1b. |
| Dataset Splits | No | The paper does not explicitly provide training, validation, or test dataset splits. It mentions using OpenAI Gym environments but does not quantify or define the data splits used for the experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like Pytorch and OpenAI Gym, but does not provide specific version numbers for any of its software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For all the tasks, once the safety constraint is violated, the episode will terminate immediately and the agent will receive a reward of -30 as a penalty. The number -30 is tuned by running SAC and choosing the one that SAC performs best with. For SAC, we use the default hyperparameters because we found they are not sensitive. For Recovery RL and SQRL, the hyperparameters are tuned in the same way as in Thananjeyan et al. [2021] . For CPO, we tune the step size and batch size. More details of experiment setup and the implementation of baselines can be found in Appendix C. |