Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Authors: Qian Long*, Zihan Zhou*, Abhinav Gupta, Fei Fang, Yi Wu†, Xiaolong Wang†

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on three challenging environments, including a predatory-prey-style Grassland game, a mixed-cooperative-and-competitive Adversarial Battle game and a fully cooperative Food Collection game. We compare EPC with multiple baseline methods on these environments with different scales of agent populations and show consistently large gains over the baselines.
Researcher Affiliation Collaboration Qian Long CMU qianlong@cs.cmu.edu Zihan Zhou SJTU footoredo@sjtu.edu.cn Abhibav Gupta CMU, Facebook AI Research abhinavg@cs.cmu.edu Fei Fang CMU feif@cs.cmu.edu Yi Wu Open AI jxwuyi@openai.com Xiaolong Wang UCSD xiw012@ucsd.edu
Pseudocode Yes Algorithm 1: Evolutionary Population Curriculum
Open Source Code Yes The source code and videos can be found at https://sites.google.com/view/epciclr2020/.
Open Datasets Yes All these environments are built on top of the particle-world environment (Mordatch & Abbeel, 2018) where agents take actions in discrete timesteps in a continous 2D world. [...] Food Collection: This is exactly the same game as the Cooperative Navigation game in the MADDPG paper.
Dataset Splits No The paper describes training within a multi-agent reinforcement learning environment through episodes and stages with progressively increasing agent populations, rather than explicitly providing fixed training/validation/test dataset splits as typically defined for static datasets.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU model, CPU type, memory) used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer and following hyperparameters from a previous work, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup Yes We follow all the hyper-parameters in the original MADDPG paper (Lowe et al., 2017) for both EPC and all the baseline methods considered. Particularly, we use the Adam optimizer with learning rate 0.01, β1 = 0.9, β2 = 0.999 and ε = 10^-8 across all experiments. τ = 0.01 is set for target network update and γ = 0.95 is used as discount factor. We also use a replay buffer of size 10^6 and we update the network parameters after every 100 samples. The batch size is 1024.