Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Authors: Filippos Christianos, Georgios Papoudakis, Muhammad A Rahman, Stefano V Albrecht

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate both whether Se PS performs as intended by correctly partitioning the agents and whether this partitioning helps in improving the overall returns, sam ple complexity, and training time. For RL, we use the A2C (Mnih et al., 2016) algorithm and report the sum of returns of all agents. Table 2: Maximum evaluation returns with std across seeds. Figure 5: Learning curves showing the mean returns during training for a selection of the environments.
Researcher Affiliation Academia 1School of Informatics, University of Edinburgh, Edinburgh, United Kingdom. Correspondence to: Filippos Christianos <f.christianos@ed.ac.uk>.
Pseudocode No The paper describes methods and algorithms but does not include any structured pseudocode or algorithm blocks (e.g., labeled Algorithm 1).
Open Source Code Yes We provide an open-source implementation of Se PS here: https://github.com/uoe-agents/seps
Open Datasets Yes We use four multi-agent environments (Fig. 3) which are described below and summarised in Table 1. Blind-particle Spread: Our motivating toy environment is a custom scenario created with the Multi-agent Particle Environment (MPE) (Lowe et al., 2017). The Coloured Multi Robot Warehouse (C-RWARE, Fig. 3b) is a variation of the RWARE environment (Christianos et al., 2020), Level-based Foraging (LBF, Fig. 3c) (Albrecht & Ramamoorthy, 2013) is a multi-agent environment... While the multi-agent Starcraft (SMAC) (Samvelyan et al., 2019) environment...
Dataset Splits No The paper uses reinforcement learning environments where agents learn directly. It does not provide specific training/validation/test dataset splits with percentages or sample counts for traditional datasets.
Hardware Specification Yes Figure 8 was generated in an AMD Epyc 7702 running Python 3 with environments sampled in parallel threads.
Software Dependencies No The paper mentions 'Python 3' but does not specify version numbers for other key software components, libraries, or frameworks used for the experiments (e.g., PyTorch, TensorFlow, specific RL libraries).
Experiment Setup Yes In our experiments, we used Adam with learning rate of 3e 4, optimiser epsilon 1e 5, entropy coeffcient 1e 2, and value, critic, and encoder-decoder networks with two layers of 64 or 128 units. Eight environments were sampled concurrently and 5-step returns were computed. ... For the encoder-decoder training, m was set at 5, the KL loss was scaled by 1e 4, and we used batch size 128.