Emergent Complexity via Multi-Agent Competition
Authors: Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, Igor Mordatch
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. |
| Researcher Affiliation | Collaboration | Trapit Bansal UMass Amherst Jakub Pachocki Open AI Szymon Sidor Open AI Ilya Sutskever Open AI Igor Mordatch Open AI |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the environments as well as learned policy parameters for agents on all the environments are available: https://github.com/openai/multiagent-competition. |
| Open Datasets | No | The paper describes a simulation environment where data is generated through agent interaction rather than using a pre-existing, publicly available dataset with concrete access information. While environments are described, they are not presented as a downloadable dataset. |
| Dataset Splits | No | The paper does not specify fixed training/validation/test dataset splits. It describes generating data via parallel rollouts during training but does not partition a static dataset into these conventional splits. |
| Hardware Specification | No | The paper mentions running experiments 'on 4 GPUs' but does not provide specific models, memory, or other detailed hardware specifications (e.g., CPU, specific GPU model numbers). |
| Software Dependencies | No | The paper mentions several software components like 'Mu Jo Co framework (Todorov et al., 2012)', 'Proximal Policy Optimization (PPO) (Schulman et al., 2017)', 'Adam (Kingma & Ba, 2014)', and 'Open AI Gym Humanoid-v1 environment', but it does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use Adam (Kingma & Ba, 2014) with learning rate 0.001. The clipping parameter in PPO ϵ = 0.2, discounting factor γ = 0.995 and generalized advantage estimate parameter λ = 0.95. Each iteration, we collect 409600 samples from the parallel rollouts and perform multiple epochs of PPO training in mini-batches consisting of 5120 samples. For MLP policies we did 6 epochs of SGD per iteration and for LSTM policies we did 3 epochs. |