DOP: Off-Policy Multi-Agent Decomposed Policy Gradients

Authors: Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on the Star Craft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
Researcher Affiliation Academia Institute for Interdisciplinary Information Sciences Tsinghua University, Beijing, China
Pseudocode Yes In this section, we describe the details of our algorithms, as shown in Algorithm 1 and 2. Algorithm 1 Stochastic DOP, Algorithm 2 Deterministic DOP
Open Source Code No The paper mentions "Demonstrative videos are available at https://sites.google.com/view/dop-mapg/" but does not provide a link to the source code for the methodology.
Open Datasets Yes We evaluate our methods on both the Star Craft II micromanagement benchmark (Samvelyan et al., 2019) (discrete action spaces) and multi-agent particle environments (Lowe et al., 2017; Mordatch & Abbeel, 2018) (continuous action spaces).
Dataset Splits No The paper evaluates on standard benchmarks like StarCraft II micromanagement and multi-agent particle environments but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) within the text.
Hardware Specification Yes Experiments are carried out on NVIDIA P100 GPUs and with fixed hyper-parameter settings, which are described in the following sections.
Software Dependencies No The paper mentions optimizers (RMSprop) and network components (GRU) but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or Python.
Experiment Setup Yes For all experiments, we set κ = 0.5 and use an off-policy replay buffer storing the latest 5000 episodes and an on-policy buffer with a size of 32. We run 4 parallel environments to collect data. The optimization of both the critic and actors is conducted using RMSprop with a learning rate of 5 10 4, α of 0.99, and with no momentum or weight decay. For exploration, we use ϵ-greedy with ϵ annealed linearly from 1.0 to 0.05 over 500k time steps and kept constant for the rest of the training. Mixed batches consisting of 32 episodes sampled from the off-policy replay buffer and 16 episodes sampled from the on-policy buffer are used to train the critic. For training actors, we sample 16 episodes from the on-policy buffer each time. The framework is trained on fully unrolled episodes. The learning rates for the critic and actors are set to 1 10 4 and 5 10 4, respectively. And we use 5-step decomposed multi-agent tree backup.