Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning
Authors: Qian Long*, Zihan Zhou*, Abhinav Gupta, Fei Fang, Yi Wu†, Xiaolong Wang†
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on three challenging environments, including a predatory-prey-style Grassland game, a mixed-cooperative-and-competitive Adversarial Battle game and a fully cooperative Food Collection game. We compare EPC with multiple baseline methods on these environments with different scales of agent populations and show consistently large gains over the baselines. |
| Researcher Affiliation | Collaboration | Qian Long CMU qianlong@cs.cmu.edu Zihan Zhou SJTU footoredo@sjtu.edu.cn Abhibav Gupta CMU, Facebook AI Research abhinavg@cs.cmu.edu Fei Fang CMU feif@cs.cmu.edu Yi Wu Open AI jxwuyi@openai.com Xiaolong Wang UCSD xiw012@ucsd.edu |
| Pseudocode | Yes | Algorithm 1: Evolutionary Population Curriculum |
| Open Source Code | Yes | The source code and videos can be found at https://sites.google.com/view/epciclr2020/. |
| Open Datasets | Yes | All these environments are built on top of the particle-world environment (Mordatch & Abbeel, 2018) where agents take actions in discrete timesteps in a continous 2D world. [...] Food Collection: This is exactly the same game as the Cooperative Navigation game in the MADDPG paper. |
| Dataset Splits | No | The paper describes training within a multi-agent reinforcement learning environment through episodes and stages with progressively increasing agent populations, rather than explicitly providing fixed training/validation/test dataset splits as typically defined for static datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and following hyperparameters from a previous work, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | We follow all the hyper-parameters in the original MADDPG paper (Lowe et al., 2017) for both EPC and all the baseline methods considered. Particularly, we use the Adam optimizer with learning rate 0.01, β1 = 0.9, β2 = 0.999 and ε = 10^-8 across all experiments. τ = 0.01 is set for target network update and γ = 0.95 is used as discount factor. We also use a replay buffer of size 10^6 and we update the network parameters after every 100 samples. The batch size is 1024. |