Verified Safe Reinforcement Learning for Neural Network Dynamic Models
Authors: Junlin Wu, Huan Zhang, Yevgeniy Vorobeychik
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on five safe control problems demonstrate that our trained controllers can achieve verified safety over horizons that are as much as an order of magnitude longer than state-of-the-art baselines, while maintaining high reward, as well as a perfect safety record over entire episodes. |
| Researcher Affiliation | Academia | Junlin Wu Computer Science & Engineering Washington University in St. Louis junlin.wu@wustl.edu Huan Zhang Electrical & Computer Engineering University of Illinois Urbana-Champaign huan@huan-zhang.com Yevgeniy Vorobeychik Computer Science & Engineering Washington University in St. Louis yvorobeychik@wustl.edu |
| Pseudocode | Yes | Algorithm 1 Curriculum Learning with Memorization and Algorithm 2 Initial-State-Dependent Controller are presented in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/jlwu002/VSRL. |
| Open Datasets | No | The paper describes the environment dynamics and initial state regions (S0) for its simulations (e.g., "Our lane following environment follows the discrete-time bicycle model"), but it does not specify or link to a publicly available or open dataset in the conventional sense (e.g., a named dataset with a specific access URL or citation). |
| Dataset Splits | No | The paper describes how initial regions S0 are split into a grid G0 for verification and how empirical safety is evaluated by sampling datapoints from S0, but it does not specify explicit training/validation/test dataset splits in percentages or counts for model training. |
| Hardware Specification | Yes | Our code runs on an AMD Ryzen 9 5900X CPU with a 12-core processor and an NVIDIA GeForce RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using "α,β-CROWN toolbox" and it is "implemented with the auto_Li RPA package" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We evaluate our proposed approach in five control settings: Lane Following, Vehicle Avoidance, 2D Quadrotor (with both fixed and moving obstacles), and 3D Quadrotor [Kong et al., 2015, Dai et al., 2021]. The dynamics of these environments are approximated using NN with ReLU activations. We use a continuous action space for those discrete-time systems. In each experiment, we specify the initial region S0 for which we wish to achieve verified safety. We then aim to achieve the maximum K for which safety can be verified. Our controller is initialized using a controller pretrained with a safe RL algorithm. When training with the bound loss, we add a large penalty on unsafe states to incentivize maintaining safety throughout the entire trajectory. For branch and bound during verification, we set the precision limit as 0.025, which means as soon as the precision of the grid region reaches this precision, we stop branching. For the dynamics approximation, we use an NN with two layers of ReLU each of size 8. The parameters are set as dmax = 0.7, θmax = π/4, and vmax = 5.0. The initial regions S0 is x [−0.5, 0.5], θ [−0.2, 0.2], v [0.0, 0.5]. |