Divergence-Regularized Multi-Agent Actor-Critic

Authors: Kefan Su, Zongqing Lu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate DMAC in a didactic stochastic game and Star Craft Multi-Agent Challenge and show that DMAC substantially improves the performance of existing MARL algorithms.
Researcher Affiliation Academia 1School of Computer Science, Peking University. Correspondence to: Zongqing Lu <zongqing.lu@pku.edu.cn>.
Pseudocode Yes Algorithm 1 gives the training procedure of DMAC.
Open Source Code No Our code is based on the implementation of Py MARL (Samvelyan et al., 2019), MAAC (Iqbal & Sha, 2019), DOP (Wang et al., 2021b), FOP (Zhang et al., 2021) and an open source code for algorithms in SMAC (https://github.com/starry-sky6688/Star Craft). The paper states their code is *based on* existing open-source implementations, but does not explicitly provide their own code or a link to it.
Open Datasets Yes We test all the methods in five tasks of SMAC (Samvelyan et al., 2019). (Samvelyan et al., 2019) is cited as 'The Star Craft Multi Agent Challenge'.
Dataset Splits No The paper describes training and evaluation procedures within a simulated environment (Stochastic Game, SMAC) but does not provide explicit train/validation/test dataset splits with percentages or fixed counts for a static dataset, which is typical for 'validation' in a reproducible context.
Hardware Specification Yes We do all the experiments by a server with 2 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software components like 'Py MARL', 'MAAC', 'DOP', 'FOP', 'GRUCell', 'ReLU', and 'RMSprop optimizer' but does not provide specific version numbers for these software dependencies or the programming language used.
Experiment Setup Yes All the policy networks are the same as two linear layers and one GRUCell layer with Re LU activation and the number of hidden units is 64. The individual Q-networks for QMIX group is the same as the policy network mentioned before. The critic network for COMA group is a MLP with three 128-unit hidden layers and Re LU activation. The attention dimension in the critic networks of MAAC group is 32. The number of hidden units of mixer network in QMIX group is 32. The learning rate for critic is 10 3 and the learning rate for actor is 10 4. We train all networks with RMSprop optimizer. The discouted factor is γ = 0.99. The coefficient of regularizer is ω = 0.01 for SMAC tasks and ω = 0.2 for the stochastic game. The td_lambda factor used in COMA group is 0.8. The parameter used for soft updating target policy is τ = 0.01.