reproducibilityindex.ai

NeuPL: Neural Population Learning

Authors: Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, Thore Graepel

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show the generality, improved performance and efﬁciency of Neu PL across several test domains.
Researcher Affiliation	Collaboration	Siqi Liu University College London Deep Mind liusiqi@google.com Luke Marris University College London Deep Mind marris@google.com Daniel Hennes Deep Mind hennes@google.com Josh Merel Deep Mind jsmerel@gmail.com Nicolas Heess Deep Mind heess@google.com Thore Graepel University College London t.graepel@ucl.ac.uk
Pseudocode	Yes	Algorithm 1 Neural Population Learning (Ours) Algorithm 2 PSRO (Lanctot et al., 2017) Algorithm 3 A meta-graph solver implementing PSRO-NASH. Algorithm 4 Neural Population Learning by RL (static F) Algorithm 5 Neural Population Learning by RL (adaptive F)
Open Source Code	No	Footnote 1 states: 'See https://neupl.github.io/demo/ for supplementary illustrations.' This link is for illustrations, not explicitly for the source code of the methodology described in the paper. No other explicit statement about code release is found.
Open Datasets	Yes	Empirically, we illustrate the generality of Neu PL by replicating known results of population learning algorithms on the classical domain of rockpaper-scissors as well as its partially-observed, spatiotemporal counterpart running-with-scissors (Vezhnevets et al., 2020). ... scales to the large-scale Game-of-Skills of Mu Jo Co Football (Liu et al., 2019)
Dataset Splits	Yes	For all experiments using Neu PL, an evaluation split ϵ = 0.3 is used.
Hardware Specification	Yes	In running-with-scissors, each Neu PL experiment uses 128 actor workers running the policy environment interaction loops and a single TPU-v2 chip running gradient updates to the agent networks. ... For Mu Jo Co Football, 256 CPU actors are used per learner. For the game of rock-paper-scissors, a single CPU worker is used instead.
Software Dependencies	No	The paper mentions 'Maximum A Posterior Optimization (MPO, Abdolmaleki et al. (2018)) as the underlying RL algorithm' and types of neural networks (LSTM, MLP), but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup	Yes	We use a small entropy cost of 0.01, learning rates of 0.001 and 0.01 for the main networks and the MPO dual variables (Abdolmaleki et al., 2018) respectively. ... The learning rate of the agent networks is set to 0.0001 while the MPO dual variables are optimized with a learning rate of 0.001. The online network parameters are copied to target networks every 100 gradient steps.