Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Authors: Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, Yaodong Yang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed methods on a series of Multi-Agent Mu Jo Co and Star Craft II tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, thereby establishing a new state of the art.
Researcher Affiliation Collaboration 1University of Oxford, 2Huawei R&D UK, 3Shanghai Tech University, 4Shanghai Jiao Tong University 5University College London 6Institute for AI, Peking University & BIGAI
Pseudocode Yes The detailed pseudocode of HATRPO is listed in Appendix D.3. ... We refer to the above procedure as HAPPO and Appendix D.4 for its full pseudocode.
Open Source Code Yes Code is available at https://github.com/PKU-MARL/ TRPO-PPO-in-MARL.
Open Datasets Yes We consider two most common benchmarks Star Craft II Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) and Multi-Agent Mu Jo Co (de Witt et al., 2020b) for evaluating MARL algorithms.
Dataset Splits No The paper uses standard benchmark environments but does not specify explicit training/validation/test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The implementation of MADDPG is adopted from the Tianshou framework (Weng et al., 2021). The paper mentions software components like 'optimizer Adam' and 'actor network mlp' but does not provide specific version numbers for any libraries or frameworks (e.g., PyTorch, TensorFlow, Python version, Tianshou version).
Experiment Setup Yes All hyperparameter settings and implementations details can be found in Appendix E. Tables 1-7 in Appendix E provide detailed hyperparameter values for SMAC and Multi-Agent Mu Jo Co domains, including learning rates, batch sizes, optimizer settings, and other configuration parameters.