Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations

Authors: Yuping Luo, Tengyu Ma

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical simulations show that zero safety violations are already challenging for a suite of simple environments with only 2-4 dimensional state space, especially if high-reward policies have to visit regions near the safety boundary. Prior methods require hundreds of violations to achieve decent rewards on these tasks, whereas our proposed algorithms incur zero violations.
Researcher Affiliation Academia Yuping Luo Princeton University yupingl@cs.princeton.edu; Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode Yes Algorithm 1 Learning barrier certificate hφ for a policy π w.r.t. a calibrated dynamics model b T.; Algorithm 2 CRABS: Co-trained Barrier Certificate for Safe RL (Details in Section 4); Algorithm 3 Safe exploration with safeguard policy πsafeguard
Open Source Code No The paper does not provide any explicit statement or link to open-source code for the described methodology.
Open Datasets Yes The task is based on Pendulum-v0 in Open AI Gym [Brockman et al., 2016], as shown in Figure 1a. The task is based on a cart pole and the goal is to move a cart (the yellow block) to control the pole (with color teal), as shown in Figure 1b.
Dataset Splits No The paper does not explicitly provide training, validation, or test dataset splits. It mentions using OpenAI Gym environments but does not quantify or define the data splits used for the experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like Pytorch and OpenAI Gym, but does not provide specific version numbers for any of its software dependencies, which is required for reproducibility.
Experiment Setup Yes For all the tasks, once the safety constraint is violated, the episode will terminate immediately and the agent will receive a reward of -30 as a penalty. The number -30 is tuned by running SAC and choosing the one that SAC performs best with. For SAC, we use the default hyperparameters because we found they are not sensitive. For Recovery RL and SQRL, the hyperparameters are tuned in the same way as in Thananjeyan et al. [2021] . For CPO, we tune the step size and batch size. More details of experiment setup and the implementation of baselines can be found in Appendix C.