RODE: Learning Roles to Decompose Multi-Agent Tasks

Authors: Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, Chongjie Zhang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental our method (1) outperforms the current state-of-the-art MARL algorithms on 9 of the 14 scenarios that comprise the challenging Star Craft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents.
Researcher Affiliation Academia Institute for Interdisciplinary Information Sciences, Tsinghua University Univeristy of Oxford tonghanwang1996@gmail.com {tarun.gupta, anuj.mahajan, bei.peng}@cs.ox.ac.uk Shimon Whiteson Univeristy of Oxford shimon.whiteson@cs.ox.ac.uk Chongjie Zhang* Tsinghua University chongjie@tsinghua.edu.cn
Pseudocode No The paper includes architectural diagrams (Figure 1) but no pseudocode or algorithm blocks.
Open Source Code No The abstract states: "Demonstrative videos can be viewed at https://sites.google.com/view/rode-marl.", which links to videos, not source code. The paper also mentions using "codes provided by their authors" for baselines, but does not state that RODE's code is open-source or available.
Open Datasets Yes We choose the Star Craft II micromanagement (SMAC) benchmark (Samvelyan et al., 2019) as the testbed for its rich environments and high complexity of control.
Dataset Splits No The paper mentions running experiments with random seeds and showing median performance and percentiles, and using a replay buffer for sampling episodes. However, it does not specify a train/validation/test dataset split with percentages or sample counts for model evaluation.
Hardware Specification Yes Experiments are carried out on NVIDIA GTX 2080 Ti GPU.
Software Dependencies No The paper mentions the use of "RMSprop" optimizer and "QMIX"-style mixing networks, but does not provide specific version numbers for any software, libraries, or programming languages used (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes For all experiments, the optimization is conducted using RMSprop with a learning rate of 5 10 4, α of 0.99, and with no momentum or weight decay. For exploration, we use ϵ-greedy with ϵ annealed linearly from 1.0 to 0.05 over 50K time steps and kept constant for the rest of the training. For three hard exploration maps 3s5z vs 3s6z, 6h vs 8z, and 27m vs 30m we extend the epsilon annealing time to 500K, for both RODE and all the baselines and ablations. Batches of 32 episodes are sampled from the replay buffer, and the role selector and role policies are trained end-to-end on fully unrolled episodes.