Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Authors: Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas Hudson, Nicolas Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Satinder Singh, Thore Graepel, Yoram Bachrach
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements. We analyze our agents through multiple lenses: We measure winrates (1) head-to-head between final agents from different algorithms and (2) against fixed populations of reference agents. (3) We consider meta-games between checkpoints of one training run to test for consistent improvement. (4) We examine the exploitability of agents from different algorithms. |
| Researcher Affiliation | Industry | Authors are employees of Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Sampled Best Response; Algorithm 2 Best Response Policy Iteration |
| Open Source Code | Yes | We will open-source these BRPI agents and our SL agent for benchmarking. |
| Open Datasets | Yes | Paquette et al. [90] achieved a major breakthrough: they collected a dataset of 150, 000 human Diplomacy games, and trained an agent, Dip Net, using a graph neural network (GNN) to imitate the moves in this dataset. We thank Kestas Kuliukas for providing the dataset of human diplomacy games. |
| Dataset Splits | Yes | These changes increase prediction accuracy by 4 5% on our validation set (data splits and performance comparison in Appendix C). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions the use of GNN, LSTM, A2C, DQN, and Actor-Critic but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | At test time, we run all networks at a softmax temperature t = 0.1. Our network is based on the imitation learning of Dip Net [90], which uses an encoder GNN to embed each province, and a LSTM decoder to output unit moves (see Dip Net paper for details). We make several improvements, described briefly here, and fully in Appendix C. |