NeuPL: Neural Population Learning

Authors: Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, Thore Graepel

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show the generality, improved performance and efficiency of Neu PL across several test domains.
Researcher Affiliation Collaboration Siqi Liu University College London Deep Mind liusiqi@google.com Luke Marris University College London Deep Mind marris@google.com Daniel Hennes Deep Mind hennes@google.com Josh Merel Deep Mind jsmerel@gmail.com Nicolas Heess Deep Mind heess@google.com Thore Graepel University College London t.graepel@ucl.ac.uk
Pseudocode Yes Algorithm 1 Neural Population Learning (Ours) Algorithm 2 PSRO (Lanctot et al., 2017) Algorithm 3 A meta-graph solver implementing PSRO-NASH. Algorithm 4 Neural Population Learning by RL (static F) Algorithm 5 Neural Population Learning by RL (adaptive F)
Open Source Code No Footnote 1 states: 'See https://neupl.github.io/demo/ for supplementary illustrations.' This link is for illustrations, not explicitly for the source code of the methodology described in the paper. No other explicit statement about code release is found.
Open Datasets Yes Empirically, we illustrate the generality of Neu PL by replicating known results of population learning algorithms on the classical domain of rockpaper-scissors as well as its partially-observed, spatiotemporal counterpart running-with-scissors (Vezhnevets et al., 2020). ... scales to the large-scale Game-of-Skills of Mu Jo Co Football (Liu et al., 2019)
Dataset Splits Yes For all experiments using Neu PL, an evaluation split ϵ = 0.3 is used.
Hardware Specification Yes In running-with-scissors, each Neu PL experiment uses 128 actor workers running the policy environment interaction loops and a single TPU-v2 chip running gradient updates to the agent networks. ... For Mu Jo Co Football, 256 CPU actors are used per learner. For the game of rock-paper-scissors, a single CPU worker is used instead.
Software Dependencies No The paper mentions 'Maximum A Posterior Optimization (MPO, Abdolmaleki et al. (2018)) as the underlying RL algorithm' and types of neural networks (LSTM, MLP), but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes We use a small entropy cost of 0.01, learning rates of 0.001 and 0.01 for the main networks and the MPO dual variables (Abdolmaleki et al., 2018) respectively. ... The learning rate of the agent networks is set to 0.0001 while the MPO dual variables are optimized with a learning rate of 0.001. The online network parameters are copied to target networks every 100 gradient steps.