Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance
Authors: Josh McClellan, Naveed Haghani, John Winder, Furong Huang, Pratap Tokekar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems. |
| Researcher Affiliation | Collaboration | Joshua Mc Clellan JHU APL University of Maryland joshmccl@umd.edu Naveed Haghani JHU APL naveed.haghani@jhuapl.edu John Winder JHU APL john.winder@jhuapl.edu Furong Huang University of Maryland furongh@umd.edu Pratap Tokekar University of Maryland tokekar@umd.edu |
| Pseudocode | No | The paper presents mathematical equations for the EGNN and E2GN2 layers but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Much of the necessary code is available online. Part of this work was developed for a private company, and permission for the full code release was not given. |
| Open Datasets | Yes | To address these questions, we use common MARL benchmarks: the multi-agent particle-world environment (MPE) [13] and Starcraft Multi-agent Challenge version 2 (SMACv2) [14]. |
| Dataset Splits | No | The paper specifies training parameters like 'train batch size' and 'mini-batch size', but it does not explicitly state a validation dataset split or strategy for hyperparameter tuning on a validation set. |
| Hardware Specification | Yes | For training hardware, we trained the graph-structured networks using various GPUs. Many were trained on an A100, but that is certainly not necessary, they didn t need that much space. MLPs were trained on CPUs. We typically used 4 rollout workers across 4 CPU threads, so each training run used 5 CPU threads. |
| Software Dependencies | No | The paper mentions using "RLlib [27]" as a training library, but it does not specify a version number for RLlib or any other critical software dependencies. |
| Experiment Setup | Yes | Further hyperparameter details are found in appendix B. Table 3: PPO Hyperparameters for SMACv2 [train batch size 8000, mini-batch size 2000, PPO clip 0.1, learning rate 25e-5, num SGD iterations 15, gamma 0.99, lambda 0.95]. Table 4: hyperparameters for MPE [train batch size 2000, mini-batch size 1000, PPO clip 0.2, learning rate 30e-5, num SGD iterations 10, gamma 0.99, lambda 0.95]. |