Divergence-Regularized Multi-Agent Actor-Critic
Authors: Kefan Su, Zongqing Lu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate DMAC in a didactic stochastic game and Star Craft Multi-Agent Challenge and show that DMAC substantially improves the performance of existing MARL algorithms. |
| Researcher Affiliation | Academia | 1School of Computer Science, Peking University. Correspondence to: Zongqing Lu <zongqing.lu@pku.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 gives the training procedure of DMAC. |
| Open Source Code | No | Our code is based on the implementation of Py MARL (Samvelyan et al., 2019), MAAC (Iqbal & Sha, 2019), DOP (Wang et al., 2021b), FOP (Zhang et al., 2021) and an open source code for algorithms in SMAC (https://github.com/starry-sky6688/Star Craft). The paper states their code is *based on* existing open-source implementations, but does not explicitly provide their own code or a link to it. |
| Open Datasets | Yes | We test all the methods in five tasks of SMAC (Samvelyan et al., 2019). (Samvelyan et al., 2019) is cited as 'The Star Craft Multi Agent Challenge'. |
| Dataset Splits | No | The paper describes training and evaluation procedures within a simulated environment (Stochastic Game, SMAC) but does not provide explicit train/validation/test dataset splits with percentages or fixed counts for a static dataset, which is typical for 'validation' in a reproducible context. |
| Hardware Specification | Yes | We do all the experiments by a server with 2 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Py MARL', 'MAAC', 'DOP', 'FOP', 'GRUCell', 'ReLU', and 'RMSprop optimizer' but does not provide specific version numbers for these software dependencies or the programming language used. |
| Experiment Setup | Yes | All the policy networks are the same as two linear layers and one GRUCell layer with Re LU activation and the number of hidden units is 64. The individual Q-networks for QMIX group is the same as the policy network mentioned before. The critic network for COMA group is a MLP with three 128-unit hidden layers and Re LU activation. The attention dimension in the critic networks of MAAC group is 32. The number of hidden units of mixer network in QMIX group is 32. The learning rate for critic is 10 3 and the learning rate for actor is 10 4. We train all networks with RMSprop optimizer. The discouted factor is γ = 0.99. The coefficient of regularizer is ω = 0.01 for SMAC tasks and ω = 0.2 for the stochastic game. The td_lambda factor used in COMA group is 0.8. The parameter used for soft updating target policy is τ = 0.01. |