RPM: Generalizable Multi-Agent Policies for Multi-Agent Reinforcement Learning

Authors: Wei Qiu, Xiao Ma, Bo An, Svetlana Obraztsova, Shuicheng YAN, Zhongwen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Melting Pot demonstrate that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks. It significantly boosts the performance up to 818% on average.
Researcher Affiliation Collaboration Nanyang Technological University Sea AI Lab qiuw0008@e.ntu.edu.sg zhongwen.s.xu@gmail.com
Pseudocode Yes Algorithm 1: MARL with RPM
Open Source Code Yes Our code, pictorial examples, videos and experimental results are available at this link: https://sites.google.com/view/rpm-iclr2023/.
Open Datasets Yes We conduct large-scale experiments with the Melting Pot (Leibo et al., 2021), which is a well-recognized benchmark for MARL generalization evaluation.
Dataset Splits No The paper uses the Melting Pot benchmark, where agents are trained in 'substrates' and evaluated in 'scenarios'. This represents an environment split for generalization evaluation, not explicit numerical or percentage splits of a static dataset into training, validation, and test sets.
Hardware Specification Yes All experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies No We implement our actors with Ray (Moritz et al., 2018) and the learner with EPy MARL (Papoudakis et al., 2021). Specific version numbers for these software dependencies are not provided.
Experiment Setup Yes We use the default training parameters from MARL-Algorithms. The learning rate for both actor and critic network is 5e-5. The clip parameter for PPO is 0.2. The discount factor γ is 0.99. The Generalized Advantage Estimation (GAE) λ is 0.95. The entropy coefficient is 0.01. The number of epochs for policy optimization is 15. The batch size is 256. The number of minibatches is 2. The observation and action space are discrete. We train agents for 200 million frames.