Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance

Authors: Josh McClellan, Naveed Haghani, John Winder, Furong Huang, Pratap Tokekar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems.
Researcher Affiliation Collaboration Joshua Mc Clellan JHU APL University of Maryland joshmccl@umd.edu Naveed Haghani JHU APL naveed.haghani@jhuapl.edu John Winder JHU APL john.winder@jhuapl.edu Furong Huang University of Maryland furongh@umd.edu Pratap Tokekar University of Maryland tokekar@umd.edu
Pseudocode No The paper presents mathematical equations for the EGNN and E2GN2 layers but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Much of the necessary code is available online. Part of this work was developed for a private company, and permission for the full code release was not given.
Open Datasets Yes To address these questions, we use common MARL benchmarks: the multi-agent particle-world environment (MPE) [13] and Starcraft Multi-agent Challenge version 2 (SMACv2) [14].
Dataset Splits No The paper specifies training parameters like 'train batch size' and 'mini-batch size', but it does not explicitly state a validation dataset split or strategy for hyperparameter tuning on a validation set.
Hardware Specification Yes For training hardware, we trained the graph-structured networks using various GPUs. Many were trained on an A100, but that is certainly not necessary, they didn t need that much space. MLPs were trained on CPUs. We typically used 4 rollout workers across 4 CPU threads, so each training run used 5 CPU threads.
Software Dependencies No The paper mentions using "RLlib [27]" as a training library, but it does not specify a version number for RLlib or any other critical software dependencies.
Experiment Setup Yes Further hyperparameter details are found in appendix B. Table 3: PPO Hyperparameters for SMACv2 [train batch size 8000, mini-batch size 2000, PPO clip 0.1, learning rate 25e-5, num SGD iterations 15, gamma 0.99, lambda 0.95]. Table 4: hyperparameters for MPE [train batch size 2000, mini-batch size 1000, PPO clip 0.2, learning rate 30e-5, num SGD iterations 10, gamma 0.99, lambda 0.95].