Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Authors: Athul Paul Jacob, David J Wu, Gabriele Farina, Adam Lerer, Hengyuan Hu, Anton Bakhtin, Jacob Andreas, Noam Brown

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In chess and Go, we show that regularized search algorithms that penalize KL divergence from an imitationlearned policy yield higher prediction accuracy of strong humans and better performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitationlearned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.
Researcher Affiliation Collaboration 1Meta AI Research, New York, NY, USA 2CSAIL, MIT, Cambridge, MA, USA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
Pseudocode Yes Algorithm 1 PIKL-HEDGE (for Player i)
Open Source Code No The paper discusses various open-source projects and datasets used (e.g., Maia models, Go Go D, Kata Go), and refers to a Github thread, but does not provide a statement or link for the open-sourcing of the code for the methods described in this paper.
Open Datasets Yes In chess, for the human-learned anchor policy we use the pretrained Maia1100, Maia1500, and Maia1900 models from Mc Ilroy-Young et al. (2020a)... For Go, we trained a deep neural net on the Go Go D professional game dataset2. 2https://gogodonline.co.uk/... We use a similar dataset acquired from en.boardgamearena.com as in (Hu et al., 2021b).
Dataset Splits Yes For Go, we trained a deep neural net on the Go Go D professional game dataset2. We match Cazenave (2017) in using games from 1900 through 2014 for training and 2015-2016 as the test set, with roughly 73000 and 6500 games, respectively. ... We randomly sample 1,000 games to create a validation set and another 4,000 games for the test set. The training set contains the remaining 235,954 games with an average score of 15.88.
Hardware Specification No The paper mentions training on "8 GPUs" but does not provide specific hardware details such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or operating systems used in the experiments.
Experiment Setup Yes We train using a mini-batch size of 2048 distributed as 8 batches of 256 across 8 GPUs, and train for a total of 64 epochs roughly 475000 minibatches for the Go Go D dataset. We use SGD with momentum 0.9, weight decay coefficient of 1e-4, and a learning rate schedule of of 1e-1, 1e-2, 1e-3, 1e-4 for the first 16, next 16, next 16, and last 16 epochs respectively. ... We perform 1-ply lookahead where on each turn, we sample up to 30 of the most likely actions for each player from a policy network trained via imitation learning on human data (IL policy)... See Appendix H for more details about the hyperparameters used.