Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Authors: Jiayu Chen, Zelai Xu, Yunfei Li, Chao Yu, Jiaming Song, Huazhong Yang, Fei Fang, Yu Wang, Yi Wu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play.
Researcher Affiliation Collaboration 1Tsinghua University 2Luma AI 3Carnegie Mellon University 4Shanghai Qi Zhi Institute
Pseudocode Yes Algorithm 1: Subgame curriculum learning; Algorithm 2: Subgame Automatic Curriculum Learning (SACL)
Open Source Code No The paper states: "The project website is at https://sites.google.com/view/sacl-rl." This is a project website, but it does not explicitly state that the source code for the methodology is provided there, nor is it a direct link to a code repository.
Open Datasets Yes We evaluate SACL in three different zero-sum environments: Multi-Agent Particle Environment (MPE) (Lowe et al. 2017), Google Research Football (GRF) (Kurach et al. 2020), and the hide-and-seek (Hn S) environment (Baker et al. 2020).
Dataset Splits No The paper mentions training durations (e.g., "trained for 40M environment samples") but does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets) for the environments used.
Hardware Specification No The paper mentions "hundreds of GPUs" in the context of prior works requiring immense resources but does not provide any specific hardware details (e.g., GPU models, CPU types) used for its own experiments.
Software Dependencies No The paper mentions using "MAPPO (Yu et al. 2021) as the backbone" but does not specify version numbers for MAPPO or any other software dependencies, such as programming languages, libraries, or frameworks.
Experiment Setup Yes All algorithms are trained for 40M environment samples and the curves of approximate exploitability w.r.t. sample over three seeds are shown in Fig. 4(a) and 4(b).; The first scenario is trained for 300M environment samples and the last two scenarios are trained for 400M samples.; To satisfy the requirements in Proposition 1, we also reset the game according to the initial state distribution ρ( ) with 0.3 probability.