Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

Authors: Arko Banerjee, Kia Rahmani, Joydeep Biswas, Isil Dillig

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that DMPS converges to policies that rarely require shield interventions after training and achieve higher rewards compared to several state-of-the-art baselines. We have implemented the DMPS algorithm in an open-source library and evaluated it on a suite of 13 representative benchmarks.
Researcher Affiliation Collaboration Arko Banerjee The University of Texas at Austin arko.banerjee@utexas.edu Kia Rahmani The University of Texas at Austin kia@durable.ai Joydeep Biswas The University of Texas at Austin joydeepb@utexas.edu Isil Dillig The University of Texas at Austin isil@utexas.edu
Pseudocode Yes Algorithm 1: Reinforcement Learning with Dynamic Recovery Planning
Open Source Code Yes We have implemented the DMPS algorithm in an open-source library and evaluated it on a suite of 13 representative benchmarks. Our submission also includes the source code of our implementations and scripts for reproducing the results. We also plan to create an open-source repository and make our implementation publicly available.
Open Datasets Yes mount-car was trained for 200,000 timesteps with a maximum episode length of 999 [84], obstacle and obstacle2 were trained for 400,000 timesteps with a maximum episode length of 200 [16], and road and road2d were trained for 100,000 timesteps with a maximum episode length of 100 [16]. We used prior implementations of REVEL [16], PPO-Lag [85], and CPO [85] to run our experiments.
Dataset Splits No The paper does not explicitly mention a validation set or validation splits, only training and testing phases.
Hardware Specification Yes Our experiments were conducted on a server with 64 available Intel Xeon Gold 5218 CPUs @2.30GHz, 264GB of available memory, and eight NVIDIA Ge Force RTX 2080 Ti GPUs.
Software Dependencies No We implemented DMPS by modifying the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm [15]. We used prior implementations of REVEL [16], PPO-Lag [85], and CPO [85] to run our experiments. (No specific version numbers are provided for these software components or underlying libraries like Python/PyTorch/TensorFlow).
Experiment Setup Yes Each of the dynamic benchmarks were trained for 200,000 timesteps with a maximum episode length of 500. The static environments were trained for the number of timesteps prescribed by the sources of the environments. Namely, mount-car was trained for 200,000 timesteps with a maximum episode length of 999 [84], obstacle and obstacle2 were trained for 400,000 timesteps with a maximum episode length of 200 [16], and road and road2d were trained for 100,000 timesteps with a maximum episode length of 100 [16]. Each experiment was run on five independent seeds. ... The negative penalty for safety violations in TD3 was taken to be large enough so that the agent could not move through obstacles and still maintain positive reward. In most cases, the penalty was simply the negation of the positive reward incurred upon successfully completing the environment. The episode did not terminate upon the first unsafe action. For CPO, we reduced tolerance for safety violations by reducing the parameter for number of acceptable violations to 1.