Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Authors: Jonathan Gray, Adam Lerer, Anton Bakhtin, Noam Brown

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. ... We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.
Researcher Affiliation Industry Jonathan Gray , Adam Lerer , Anton Bakhtin, Noam Brown Facebook AI Research {jsgray,alerer,yolo,noambrown}@fb.com
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any links to open-source code for the described methodology or explicitly state that the code will be released.
Open Datasets No The paper states, 'We thank Kestas Kuliukas and the entire webdiplomacy.net team for their cooperation and for providing the dataset used in this research.' but does not provide concrete access information (link, DOI, or explicit public availability) for the dataset.
Dataset Splits No The paper refers to a 'corpus of 46,148 Diplomacy games' and 'training data', but does not explicitly provide training, validation, or test dataset splits.
Hardware Specification Yes The search algorithm used ...typically required between 2 minutes and 20 minutes per turn using a single Volta GPU and 8 CPU cores, depending on the hyperparameters used for the game.
Software Dependencies No The paper mentions software components like 'Python', 'PyTorch', and 'CUDA' implicitly through context (e.g., 'deep reinforcement learning', 'neural networks'), but it does not specify explicit version numbers for any key software dependencies.
Experiment Setup Yes In non-live games, we typically ran RM for 2,048 iterations with a rollout length of 3 movement phases, and set M (the constant which is multiplied by the number of units to determine the number of subgame actions) equal to 5. ... In live games...we ran RM for 256 iterations with a rollout length of 2 movement phases, and set M equal to 3.5. ... In all cases, the temperature for the blueprint in rollouts was set to 0.75.