Shield Decentralization for Safe Multi-Agent Reinforcement Learning

Authors: Daniel Melcer, Christopher Amato, Stavros Tripakis

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that agents equipped with decentralized shields perform comparably to agents with centralized shields in several tasks, allowing shielding to be used in environments with decentralized training and execution for the first time. We evaluate the performance of our shield decomposition algorithm on shields from several domains. We first analyze the structure of the shield itself, ensuring that the majority of actions allowed by the centralized shield are still allowed by the decentralized shield. We then train a shielded reinforcement learning agent to solve these tasks, and measure the performance of the learned policy. We find that all agents achieve a comparable value in the environment, and that, as is to be expected from the theoretical guarantees of our framework, decentralized shielding achieves 100% safety.
Researcher Affiliation Academia Daniel Melcer Northeastern University Boston, MA 02115 melcer.d@northeastern.edu Christopher Amato Northeastern University Boston, MA 02115 c.amato@northeastern.edu Stavros Tripakis Northeastern University Boston, MA 02115 stavros@northeastern.edu
Pseudocode Yes Algorithm 1 Step 1 of Shield Decomposition: Determining Safe Actions. Algorithm 2 (in appendix) projects the actions found by the previous step into an input-output state machine for each agent, which we call a transient-state individual shield. Lastly, Algorithm 3 (in appendix) performs some mild post-processing so that this structure conforms to the shield interface.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We attach our code as supplemental material.
Open Datasets Yes We adapt the gridworld maps from Melo and Veloso [16] to test our method.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It mentions training for a certain number of timesteps and then evaluating over 'testing episodes', but not a formal data split for the reinforcement learning process itself, which typically involves online interaction rather than fixed datasets.
Hardware Specification Yes Centralized shield synthesis took approximately five minutes on a M1 Macbook Pro, and our decentralization algorithm ran in under 30 seconds. Each individual run took approximately 5 minutes on a single core of an Intel server CPU; we trained 240 agents in total. Each run took approximately 8 hours using 3 threads on a server CPU.
Software Dependencies No The paper mentions software like Q-learning, DQN, and neural networks, but does not specify version numbers for any libraries or frameworks (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes We trained independent tabular Q-learning agents using -greedy exploration, with a linear annealing schedule from 1 to 0.05. The discount factor is 0.9. Agents were trained with a centralized shield, a decentralized shield, and with no shield. When an agent attempts to take an action a which is not allowed by the shield, the penalty reward for the synthetic transition rp = 10. Full hyperparameters for this agent are located in the appendix.