reproducibilityindex.ai

Emergent Coordination Through Competition

Authors: Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, Thore Graepel

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that decentralized, populationbased training with co-play can lead to a progression in agents behaviors: from random, to simple ball chasing, and ﬁnally showing evidence of cooperation. Our study highlights several of the challenges encountered in large scale multi-agent training in continuous control. In particular, we demonstrate that the automatic optimization of simple shaping rewards, not themselves conducive to co-operative behavior, can lead to long-horizon team behavior. We further apply an evaluation scheme, grounded by game theoretic principals, that can assess agent performance in the absence of pre-deﬁned evaluation tasks or human baselines.
Researcher Affiliation	Industry	Siqi Liu , Guy Lever , Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, Thore Graepel Deep Mind London, United Kingdom {liusiqi,guylever,jsmerel,stunya,heess,thore}@google.com
Pseudocode	Yes	Algorithm 1 Population-based Training for Multi-Agent RL. Algorithm 2 Off-policy SVG0 algorithm (Heess et al., 2015b). Algorithm 3 Iterative Elo rating update. Algorithm 4 Given agent i, select an agent j to evolve to. Algorithm 5 Agent i inherits from agent j by cross-over.
Open Source Code	Yes	1The environment is released at https://git.io/dm_control_soccer.
Open Datasets	Yes	We simulate 2v2 soccer using the Mu Jo Co physics engine (Todorov et al., 2012). ... 1The environment is released at https://git.io/dm_control_soccer.
Dataset Splits	No	We train agents on a ﬁeld whose dimensions are randomized in the range 20m 15m to 28m 21m, with ﬁxed aspect ratio, and are tested on a ﬁeld of ﬁxed size 24m 18m. The paper does not explicitly mention a separate 'validation' dataset split.
Hardware Specification	No	The paper does not provide any specific details about the hardware used, such as GPU/CPU models or cloud instance types.
Software Dependencies	No	The paper mentions software components like 'Mu Jo Co (Todorov et al., 2012)', 'Adam optimizer (Kingma & Ba, 2014)', and 'Elu activations (Clevert et al., 2015)', but does not provide specific version numbers for any of these or other software dependencies.
Experiment Setup	Yes	We use population-based training with 32 agents in the population, an agent is chosen for evolution if its expected win rate against another chosen agent drops below 0.47. The k-factor learning rate for Elo is 0.1... The maximum, minimum and mean of each dimension is then passed as input to the remainder of the network, where it is concatenated with the ball and pitch features. Both critic and actor then apply 2 feed-forward, elu-activated, layers of size 512 and 256, followed by a ﬁnal layer of 256 neurons which is either feed-forward or made recurrent using an LSTM... In our soccer experiments k = 40. ...periodically synced with the online action-value critic and policy (in our experiments we sync after every 100 gradient steps)... we apply a mutation probability of pmutate = 0.1 and pperturb = 0.2 for all experiments.