Emergent Complexity via Multi-Agent Competition

Authors: Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, Igor Mordatch

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple.
Researcher Affiliation Collaboration Trapit Bansal UMass Amherst Jakub Pachocki Open AI Szymon Sidor Open AI Ilya Sutskever Open AI Igor Mordatch Open AI
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code for the environments as well as learned policy parameters for agents on all the environments are available: https://github.com/openai/multiagent-competition.
Open Datasets No The paper describes a simulation environment where data is generated through agent interaction rather than using a pre-existing, publicly available dataset with concrete access information. While environments are described, they are not presented as a downloadable dataset.
Dataset Splits No The paper does not specify fixed training/validation/test dataset splits. It describes generating data via parallel rollouts during training but does not partition a static dataset into these conventional splits.
Hardware Specification No The paper mentions running experiments 'on 4 GPUs' but does not provide specific models, memory, or other detailed hardware specifications (e.g., CPU, specific GPU model numbers).
Software Dependencies No The paper mentions several software components like 'Mu Jo Co framework (Todorov et al., 2012)', 'Proximal Policy Optimization (PPO) (Schulman et al., 2017)', 'Adam (Kingma & Ba, 2014)', and 'Open AI Gym Humanoid-v1 environment', but it does not provide specific version numbers for any of them.
Experiment Setup Yes We use Adam (Kingma & Ba, 2014) with learning rate 0.001. The clipping parameter in PPO ϵ = 0.2, discounting factor γ = 0.995 and generalized advantage estimate parameter λ = 0.95. Each iteration, we collect 409600 samples from the parallel rollouts and perform multiple epochs of PPO training in mini-batches consisting of 5120 samples. For MLP policies we did 6 epochs of SGD per iteration and for LSTM policies we did 3 epochs.