reproducibilityindex.ai

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Authors: Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, Yoram Bachrach

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements. We analyze our agents through multiple lenses: We measure winrates (1) head-to-head between final agents from different algorithms and (2) against fixed populations of reference agents. (3) We consider meta-games between checkpoints of one training run to test for consistent improvement. (4) We examine the exploitability of agents from different algorithms.
Researcher Affiliation	Industry	Authors are employees of Deep Mind.
Pseudocode	Yes	Algorithm 1 Sampled Best Response; Algorithm 2 Best Response Policy Iteration
Open Source Code	Yes	We will open-source these BRPI agents and our SL agent for benchmarking.
Open Datasets	Yes	Paquette et al. [90] achieved a major breakthrough: they collected a dataset of 150, 000 human Diplomacy games, and trained an agent, Dip Net, using a graph neural network (GNN) to imitate the moves in this dataset. We thank Kestas Kuliukas for providing the dataset of human diplomacy games.
Dataset Splits	Yes	These changes increase prediction accuracy by 4 5% on our validation set (data splits and performance comparison in Appendix C).
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions the use of GNN, LSTM, A2C, DQN, and Actor-Critic but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	At test time, we run all networks at a softmax temperature t = 0.1. Our network is based on the imitation learning of Dip Net [90], which uses an encoder GNN to embed each province, and a LSTM decoder to output unit moves (see Dip Net paper for details). We make several improvements, described brieﬂy here, and fully in Appendix C.