RPM: Generalizable Multi-Agent Policies for Multi-Agent Reinforcement Learning
Authors: Wei Qiu, Xiao Ma, Bo An, Svetlana Obraztsova, Shuicheng YAN, Zhongwen Xu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Melting Pot demonstrate that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks. It significantly boosts the performance up to 818% on average. |
| Researcher Affiliation | Collaboration | Nanyang Technological University Sea AI Lab qiuw0008@e.ntu.edu.sg zhongwen.s.xu@gmail.com |
| Pseudocode | Yes | Algorithm 1: MARL with RPM |
| Open Source Code | Yes | Our code, pictorial examples, videos and experimental results are available at this link: https://sites.google.com/view/rpm-iclr2023/. |
| Open Datasets | Yes | We conduct large-scale experiments with the Melting Pot (Leibo et al., 2021), which is a well-recognized benchmark for MARL generalization evaluation. |
| Dataset Splits | No | The paper uses the Melting Pot benchmark, where agents are trained in 'substrates' and evaluated in 'scenarios'. This represents an environment split for generalization evaluation, not explicit numerical or percentage splits of a static dataset into training, validation, and test sets. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | We implement our actors with Ray (Moritz et al., 2018) and the learner with EPy MARL (Papoudakis et al., 2021). Specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | We use the default training parameters from MARL-Algorithms. The learning rate for both actor and critic network is 5e-5. The clip parameter for PPO is 0.2. The discount factor γ is 0.99. The Generalized Advantage Estimation (GAE) λ is 0.95. The entropy coefficient is 0.01. The number of epochs for policy optimization is 15. The batch size is 256. The number of minibatches is 2. The observation and action space are discrete. We train agents for 200 million frames. |