reproducibilityindex.ai

Shield Decentralization for Safe Multi-Agent Reinforcement Learning

Authors: Daniel Melcer, Christopher Amato, Stavros Tripakis

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that agents equipped with decentralized shields perform comparably to agents with centralized shields in several tasks, allowing shielding to be used in environments with decentralized training and execution for the ﬁrst time. We evaluate the performance of our shield decomposition algorithm on shields from several domains. We ﬁrst analyze the structure of the shield itself, ensuring that the majority of actions allowed by the centralized shield are still allowed by the decentralized shield. We then train a shielded reinforcement learning agent to solve these tasks, and measure the performance of the learned policy. We find that all agents achieve a comparable value in the environment, and that, as is to be expected from the theoretical guarantees of our framework, decentralized shielding achieves 100% safety.
Researcher Affiliation	Academia	Daniel Melcer Northeastern University Boston, MA 02115 melcer.d@northeastern.edu Christopher Amato Northeastern University Boston, MA 02115 c.amato@northeastern.edu Stavros Tripakis Northeastern University Boston, MA 02115 stavros@northeastern.edu
Pseudocode	Yes	Algorithm 1 Step 1 of Shield Decomposition: Determining Safe Actions. Algorithm 2 (in appendix) projects the actions found by the previous step into an input-output state machine for each agent, which we call a transient-state individual shield. Lastly, Algorithm 3 (in appendix) performs some mild post-processing so that this structure conforms to the shield interface.
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We attach our code as supplemental material.
Open Datasets	Yes	We adapt the gridworld maps from Melo and Veloso [16] to test our method.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It mentions training for a certain number of timesteps and then evaluating over 'testing episodes', but not a formal data split for the reinforcement learning process itself, which typically involves online interaction rather than fixed datasets.
Hardware Specification	Yes	Centralized shield synthesis took approximately ﬁve minutes on a M1 Macbook Pro, and our decentralization algorithm ran in under 30 seconds. Each individual run took approximately 5 minutes on a single core of an Intel server CPU; we trained 240 agents in total. Each run took approximately 8 hours using 3 threads on a server CPU.
Software Dependencies	No	The paper mentions software like Q-learning, DQN, and neural networks, but does not specify version numbers for any libraries or frameworks (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	We trained independent tabular Q-learning agents using -greedy exploration, with a linear annealing schedule from 1 to 0.05. The discount factor is 0.9. Agents were trained with a centralized shield, a decentralized shield, and with no shield. When an agent attempts to take an action a which is not allowed by the shield, the penalty reward for the synthetic transition rp = 10. Full hyperparameters for this agent are located in the appendix.