Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

Authors: Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, Yi Wu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also validate our two suggestions in more complex domains including the Star Craft Multi-Agent Challenge (SMAC) (Rashid et al., 2019) and Google Research Football (GRF) (Kurach et al., 2019). We compare the empirical performances of agent-specific policy learning, including PG-Ind and PG-ID, with shared policy learning (PG-sh) as well as popular VD algorithms, including QMIX and QPLEX, on the 2-player Bridge game. All the algorithms use the same batch size and are properly trained with sufficient samples. The final evaluation rewards are shown in Table. 2.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Department of Electronics Engineering, Tsinghua University, China 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA 4Shanghai Qi Zhi Institute, China.
Pseudocode No No explicit pseudocode block or algorithm section was found in the paper.
Open Source Code No Check our project website at https://sites.google.com/view/revisiting-marl. The website is a project overview page, not a direct link to a code repository.
Open Datasets Yes Star Craft Multi-Agent Challenge (SMAC) (Rashid et al., 2019) and Google Research Football (GRF) (Kurach et al., 2019).
Dataset Splits No The paper specifies training duration in environment frames (e.g., 'PG methods are trained for 50M environment frames') and number of random seeds for averaging, but does not provide explicit training/validation/test *dataset* splits or predefined splits in the traditional sense.
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory amounts) used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions software components like 'MAPPO project' and 'Adam optimizer', but does not provide specific version numbers for any key software libraries or dependencies (e.g., PyTorch, Python).
Experiment Setup Yes Hyperparameters of VD methods (except for CDS and RODE) and PG methods are shown in Table 6 and Table 7. For all the networks and embedding layers, we use 64 hidden units. The backbone of policy, value, and Q network is a 2-hidden-layer MLP for Bridge, with an additional GRU layer for SMAC and GRF. We use 4 attention heads for QPLEX and the attention-based backbone of auto-regressive policy. We also add layer norm after each linear layer. Value normalization is applied to PG methods. The batch size is 3200 for PG methods in Bridge and SMAC, and 10000 in GRF. The PPO epoch is 5 in Bridge and 15 across all GRF scenarios. PG methods are trained for 50M environment frames on the counterattack-hard and corner scenario, and 25M frames on other scenarios in GRF.