Shield Decentralization for Safe Multi-Agent Reinforcement Learning
Authors: Daniel Melcer, Christopher Amato, Stavros Tripakis
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that agents equipped with decentralized shields perform comparably to agents with centralized shields in several tasks, allowing shielding to be used in environments with decentralized training and execution for the first time. We evaluate the performance of our shield decomposition algorithm on shields from several domains. We first analyze the structure of the shield itself, ensuring that the majority of actions allowed by the centralized shield are still allowed by the decentralized shield. We then train a shielded reinforcement learning agent to solve these tasks, and measure the performance of the learned policy. We find that all agents achieve a comparable value in the environment, and that, as is to be expected from the theoretical guarantees of our framework, decentralized shielding achieves 100% safety. |
| Researcher Affiliation | Academia | Daniel Melcer Northeastern University Boston, MA 02115 melcer.d@northeastern.edu Christopher Amato Northeastern University Boston, MA 02115 c.amato@northeastern.edu Stavros Tripakis Northeastern University Boston, MA 02115 stavros@northeastern.edu |
| Pseudocode | Yes | Algorithm 1 Step 1 of Shield Decomposition: Determining Safe Actions. Algorithm 2 (in appendix) projects the actions found by the previous step into an input-output state machine for each agent, which we call a transient-state individual shield. Lastly, Algorithm 3 (in appendix) performs some mild post-processing so that this structure conforms to the shield interface. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We attach our code as supplemental material. |
| Open Datasets | Yes | We adapt the gridworld maps from Melo and Veloso [16] to test our method. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It mentions training for a certain number of timesteps and then evaluating over 'testing episodes', but not a formal data split for the reinforcement learning process itself, which typically involves online interaction rather than fixed datasets. |
| Hardware Specification | Yes | Centralized shield synthesis took approximately five minutes on a M1 Macbook Pro, and our decentralization algorithm ran in under 30 seconds. Each individual run took approximately 5 minutes on a single core of an Intel server CPU; we trained 240 agents in total. Each run took approximately 8 hours using 3 threads on a server CPU. |
| Software Dependencies | No | The paper mentions software like Q-learning, DQN, and neural networks, but does not specify version numbers for any libraries or frameworks (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | We trained independent tabular Q-learning agents using -greedy exploration, with a linear annealing schedule from 1 to 0.05. The discount factor is 0.9. Agents were trained with a centralized shield, a decentralized shield, and with no shield. When an agent attempts to take an action a which is not allowed by the shield, the penalty reward for the synthetic transition rp = 10. Full hyperparameters for this agent are located in the appendix. |