reproducibilityindex.ai

RPM: Generalizable Multi-Agent Policies for Multi-Agent Reinforcement Learning

Authors: Wei Qiu, Xiao Ma, Bo An, Svetlana Obraztsova, Shuicheng YAN, Zhongwen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on Melting Pot demonstrate that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks. It significantly boosts the performance up to 818% on average.
Researcher Affiliation	Collaboration	Nanyang Technological University Sea AI Lab qiuw0008@e.ntu.edu.sg zhongwen.s.xu@gmail.com
Pseudocode	Yes	Algorithm 1: MARL with RPM
Open Source Code	Yes	Our code, pictorial examples, videos and experimental results are available at this link: https://sites.google.com/view/rpm-iclr2023/.
Open Datasets	Yes	We conduct large-scale experiments with the Melting Pot (Leibo et al., 2021), which is a well-recognized benchmark for MARL generalization evaluation.
Dataset Splits	No	The paper uses the Melting Pot benchmark, where agents are trained in 'substrates' and evaluated in 'scenarios'. This represents an environment split for generalization evaluation, not explicit numerical or percentage splits of a static dataset into training, validation, and test sets.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies	No	We implement our actors with Ray (Moritz et al., 2018) and the learner with EPy MARL (Papoudakis et al., 2021). Specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	We use the default training parameters from MARL-Algorithms. The learning rate for both actor and critic network is 5e-5. The clip parameter for PPO is 0.2. The discount factor γ is 0.99. The Generalized Advantage Estimation (GAE) λ is 0.95. The entropy coefficient is 0.01. The number of epochs for policy optimization is 15. The batch size is 256. The number of minibatches is 2. The observation and action space are discrete. We train agents for 200 million frames.