Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Authors: Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, Yoram Bachrach
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements. We analyze our agents through multiple lenses: We measure winrates (1) head-to-head between final agents from different algorithms and (2) against fixed populations of reference agents. (3) We consider meta-games between checkpoints of one training run to test for consistent improvement. (4) We examine the exploitability of agents from different algorithms. |
| Researcher Affiliation | Industry | Authors are employees of Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Sampled Best Response; Algorithm 2 Best Response Policy Iteration |
| Open Source Code | Yes | We will open-source these BRPI agents and our SL agent for benchmarking. |
| Open Datasets | Yes | Paquette et al. [90] achieved a major breakthrough: they collected a dataset of 150, 000 human Diplomacy games, and trained an agent, Dip Net, using a graph neural network (GNN) to imitate the moves in this dataset. We thank Kestas Kuliukas for providing the dataset of human diplomacy games. |
| Dataset Splits | Yes | These changes increase prediction accuracy by 4 5% on our validation set (data splits and performance comparison in Appendix C). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions the use of GNN, LSTM, A2C, DQN, and Actor-Critic but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | At test time, we run all networks at a softmax temperature t = 0.1. Our network is based on the imitation learning of Dip Net [90], which uses an encoder GNN to embed each province, and a LSTM decoder to output unit moves (see Dip Net paper for details). We make several improvements, described briefly here, and fully in Appendix C. |