Emergent Tool Use From Multi-Agent Autocurricula

Authors: Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, Igor Mordatch

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a selfsupervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.
Researcher Affiliation Industry Bowen Baker Open AI bowen@openai.com Ingmar Kanitscheider Open AI ingmar@openai.com Todor Markov Open AI todor@openai.com Yi Wu Open AI jxwuyi@openai.com Glenn Powell Open AI glenn@openai.com Bob Mc Grew Open AI bmcgrew@openai.com Igor Mordatch Google Brain imordatch@google.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper. Figure 2 shows a policy architecture diagram, but it's not a pseudocode or algorithm.
Open Source Code Yes The main contributions of this work are: ... 4) open-sourced environments and code1 for environment construction to encourage further research in physically grounded multi-agent autocurricula. 1Code can be found at github.com/openai/multi-agent-emergence-environments.
Open Datasets No The paper describes a simulated environment where agents train through self-play, generating their own experience. It does not use or provide a pre-existing, publicly available dataset in the conventional sense for training.
Dataset Splits No The paper does not provide specific train/validation/test dataset splits. It describes the environment and training process for reinforcement learning agents where data is generated dynamically, rather than partitioned from a static dataset.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are mentioned in the paper. It refers to a 'large-scale distributed RL framework' and 'compute budget' but no explicit specifications.
Software Dependencies No The paper mentions software components like 'MUJOCO physics engine', 'Proximal Policy Optimization (PPO)', 'Generalized Advantage Estimation (GAE)', 'Adam', 'LSTM', and 'layer normalization', along with citations. However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes Our optimization hyperparameter settings are as follows: Buffer size 320,000 Mini-batch size 64,000 chunks of 10 timesteps Learning rate 3 × 10−4 PPO clipping parameter ε 0.2 Gradient clipping 5 Entropy coefficient 0.01 γ 0.998 λ 0.95 Max GAE horizon length T 160 BPTT truncation length 10