Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning
Authors: Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, Yi Wu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also validate our two suggestions in more complex domains including the Star Craft Multi-Agent Challenge (SMAC) (Rashid et al., 2019) and Google Research Football (GRF) (Kurach et al., 2019). We compare the empirical performances of agent-specific policy learning, including PG-Ind and PG-ID, with shared policy learning (PG-sh) as well as popular VD algorithms, including QMIX and QPLEX, on the 2-player Bridge game. All the algorithms use the same batch size and are properly trained with sufficient samples. The final evaluation rewards are shown in Table. 2. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Department of Electronics Engineering, Tsinghua University, China 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA 4Shanghai Qi Zhi Institute, China. |
| Pseudocode | No | No explicit pseudocode block or algorithm section was found in the paper. |
| Open Source Code | No | Check our project website at https://sites.google.com/view/revisiting-marl. The website is a project overview page, not a direct link to a code repository. |
| Open Datasets | Yes | Star Craft Multi-Agent Challenge (SMAC) (Rashid et al., 2019) and Google Research Football (GRF) (Kurach et al., 2019). |
| Dataset Splits | No | The paper specifies training duration in environment frames (e.g., 'PG methods are trained for 50M environment frames') and number of random seeds for averaging, but does not provide explicit training/validation/test *dataset* splits or predefined splits in the traditional sense. |
| Hardware Specification | No | No specific hardware details (e.g., CPU, GPU models, memory amounts) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | The paper mentions software components like 'MAPPO project' and 'Adam optimizer', but does not provide specific version numbers for any key software libraries or dependencies (e.g., PyTorch, Python). |
| Experiment Setup | Yes | Hyperparameters of VD methods (except for CDS and RODE) and PG methods are shown in Table 6 and Table 7. For all the networks and embedding layers, we use 64 hidden units. The backbone of policy, value, and Q network is a 2-hidden-layer MLP for Bridge, with an additional GRU layer for SMAC and GRF. We use 4 attention heads for QPLEX and the attention-based backbone of auto-regressive policy. We also add layer norm after each linear layer. Value normalization is applied to PG methods. The batch size is 3200 for PG methods in Bridge and SMAC, and 10000 in GRF. The PPO epoch is 5 in Bridge and 15 across all GRF scenarios. PG methods are trained for 50M environment frames on the counterattack-hard and corner scenario, and 25M frames on other scenarios in GRF. |