Emergent Coordination Through Competition

Authors: Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, Thore Graepel

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that decentralized, populationbased training with co-play can lead to a progression in agents behaviors: from random, to simple ball chasing, and finally showing evidence of cooperation. Our study highlights several of the challenges encountered in large scale multi-agent training in continuous control. In particular, we demonstrate that the automatic optimization of simple shaping rewards, not themselves conducive to co-operative behavior, can lead to long-horizon team behavior. We further apply an evaluation scheme, grounded by game theoretic principals, that can assess agent performance in the absence of pre-defined evaluation tasks or human baselines.
Researcher Affiliation Industry Siqi Liu , Guy Lever , Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, Thore Graepel Deep Mind London, United Kingdom {liusiqi,guylever,jsmerel,stunya,heess,thore}@google.com
Pseudocode Yes Algorithm 1 Population-based Training for Multi-Agent RL. Algorithm 2 Off-policy SVG0 algorithm (Heess et al., 2015b). Algorithm 3 Iterative Elo rating update. Algorithm 4 Given agent i, select an agent j to evolve to. Algorithm 5 Agent i inherits from agent j by cross-over.
Open Source Code Yes 1The environment is released at https://git.io/dm_control_soccer.
Open Datasets Yes We simulate 2v2 soccer using the Mu Jo Co physics engine (Todorov et al., 2012). ... 1The environment is released at https://git.io/dm_control_soccer.
Dataset Splits No We train agents on a field whose dimensions are randomized in the range 20m 15m to 28m 21m, with fixed aspect ratio, and are tested on a field of fixed size 24m 18m. The paper does not explicitly mention a separate 'validation' dataset split.
Hardware Specification No The paper does not provide any specific details about the hardware used, such as GPU/CPU models or cloud instance types.
Software Dependencies No The paper mentions software components like 'Mu Jo Co (Todorov et al., 2012)', 'Adam optimizer (Kingma & Ba, 2014)', and 'Elu activations (Clevert et al., 2015)', but does not provide specific version numbers for any of these or other software dependencies.
Experiment Setup Yes We use population-based training with 32 agents in the population, an agent is chosen for evolution if its expected win rate against another chosen agent drops below 0.47. The k-factor learning rate for Elo is 0.1... The maximum, minimum and mean of each dimension is then passed as input to the remainder of the network, where it is concatenated with the ball and pitch features. Both critic and actor then apply 2 feed-forward, elu-activated, layers of size 512 and 256, followed by a final layer of 256 neurons which is either feed-forward or made recurrent using an LSTM... In our soccer experiments k = 40. ...periodically synced with the online action-value critic and policy (in our experiments we sync after every 100 gradient steps)... we apply a mutation probability of pmutate = 0.1 and pperturb = 0.2 for all experiments.