Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Authors: Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, Yoram Bachrach

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements. We analyze our agents through multiple lenses: We measure winrates (1) head-to-head between final agents from different algorithms and (2) against fixed populations of reference agents. (3) We consider meta-games between checkpoints of one training run to test for consistent improvement. (4) We examine the exploitability of agents from different algorithms.
Researcher Affiliation Industry Authors are employees of Deep Mind.
Pseudocode Yes Algorithm 1 Sampled Best Response; Algorithm 2 Best Response Policy Iteration
Open Source Code Yes We will open-source these BRPI agents and our SL agent for benchmarking.
Open Datasets Yes Paquette et al. [90] achieved a major breakthrough: they collected a dataset of 150, 000 human Diplomacy games, and trained an agent, Dip Net, using a graph neural network (GNN) to imitate the moves in this dataset. We thank Kestas Kuliukas for providing the dataset of human diplomacy games.
Dataset Splits Yes These changes increase prediction accuracy by 4 5% on our validation set (data splits and performance comparison in Appendix C).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions the use of GNN, LSTM, A2C, DQN, and Actor-Critic but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes At test time, we run all networks at a softmax temperature t = 0.1. Our network is based on the imitation learning of Dip Net [90], which uses an encoder GNN to embed each province, and a LSTM decoder to output unit moves (see Dip Net paper for details). We make several improvements, described briefly here, and fully in Appendix C.