Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DOP: Off-Policy Multi-Agent Decomposed Policy Gradients
Authors: Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on the Star Craft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. |
| Researcher Affiliation | Academia | Institute for Interdisciplinary Information Sciences Tsinghua University, Beijing, China |
| Pseudocode | Yes | In this section, we describe the details of our algorithms, as shown in Algorithm 1 and 2. Algorithm 1 Stochastic DOP, Algorithm 2 Deterministic DOP |
| Open Source Code | No | The paper mentions "Demonstrative videos are available at https://sites.google.com/view/dop-mapg/" but does not provide a link to the source code for the methodology. |
| Open Datasets | Yes | We evaluate our methods on both the Star Craft II micromanagement benchmark (Samvelyan et al., 2019) (discrete action spaces) and multi-agent particle environments (Lowe et al., 2017; Mordatch & Abbeel, 2018) (continuous action spaces). |
| Dataset Splits | No | The paper evaluates on standard benchmarks like StarCraft II micromanagement and multi-agent particle environments but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) within the text. |
| Hardware Specification | Yes | Experiments are carried out on NVIDIA P100 GPUs and with fixed hyper-parameter settings, which are described in the following sections. |
| Software Dependencies | No | The paper mentions optimizers (RMSprop) and network components (GRU) but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or Python. |
| Experiment Setup | Yes | For all experiments, we set κ = 0.5 and use an off-policy replay buffer storing the latest 5000 episodes and an on-policy buffer with a size of 32. We run 4 parallel environments to collect data. The optimization of both the critic and actors is conducted using RMSprop with a learning rate of 5 10 4, α of 0.99, and with no momentum or weight decay. For exploration, we use ϵ-greedy with ϵ annealed linearly from 1.0 to 0.05 over 500k time steps and kept constant for the rest of the training. Mixed batches consisting of 32 episodes sampled from the off-policy replay buffer and 16 episodes sampled from the on-policy buffer are used to train the critic. For training actors, we sample 16 episodes from the on-policy buffer each time. The framework is trained on fully unrolled episodes. The learning rates for the critic and actors are set to 1 10 4 and 5 10 4, respectively. And we use 5-step decomposed multi-agent tree backup. |