DOP: Off-Policy Multi-Agent Decomposed Policy Gradients
Authors: Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on the Star Craft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. |
| Researcher Affiliation | Academia | Institute for Interdisciplinary Information Sciences Tsinghua University, Beijing, China |
| Pseudocode | Yes | In this section, we describe the details of our algorithms, as shown in Algorithm 1 and 2. Algorithm 1 Stochastic DOP, Algorithm 2 Deterministic DOP |
| Open Source Code | No | The paper mentions "Demonstrative videos are available at https://sites.google.com/view/dop-mapg/" but does not provide a link to the source code for the methodology. |
| Open Datasets | Yes | We evaluate our methods on both the Star Craft II micromanagement benchmark (Samvelyan et al., 2019) (discrete action spaces) and multi-agent particle environments (Lowe et al., 2017; Mordatch & Abbeel, 2018) (continuous action spaces). |
| Dataset Splits | No | The paper evaluates on standard benchmarks like StarCraft II micromanagement and multi-agent particle environments but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) within the text. |
| Hardware Specification | Yes | Experiments are carried out on NVIDIA P100 GPUs and with fixed hyper-parameter settings, which are described in the following sections. |
| Software Dependencies | No | The paper mentions optimizers (RMSprop) and network components (GRU) but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or Python. |
| Experiment Setup | Yes | For all experiments, we set κ = 0.5 and use an off-policy replay buffer storing the latest 5000 episodes and an on-policy buffer with a size of 32. We run 4 parallel environments to collect data. The optimization of both the critic and actors is conducted using RMSprop with a learning rate of 5 10 4, α of 0.99, and with no momentum or weight decay. For exploration, we use ϵ-greedy with ϵ annealed linearly from 1.0 to 0.05 over 500k time steps and kept constant for the rest of the training. Mixed batches consisting of 32 episodes sampled from the off-policy replay buffer and 16 episodes sampled from the on-policy buffer are used to train the critic. For training actors, we sample 16 episodes from the on-policy buffer each time. The framework is trained on fully unrolled episodes. The learning rates for the critic and actors are set to 1 10 4 and 5 10 4, respectively. And we use 5-step decomposed multi-agent tree backup. |