reproducibilityindex.ai

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Authors: Anian Ruoss, Grégoire Delétang, Sourabh Medapati, Jordi Grau-Moya, Kevin Li, Elliot Catt, John Reid, Cannada Lewis, Joel Veness, Tim Genewein

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper uses chess, a landmark planning problem in AI, to assess transformers performance on a planning task where memorization is futile even at a large scale. To this end, we release Chess Bench, a large-scale benchmark dataset of 10 million chess games with legal move and value annotations (15 billion data points) provided by Stockfish 16, the state-of-the-art chess engine. We train transformers with up to 270 million parameters on Chess Bench via supervised learning and perform extensive ablations to assess the impact of dataset size, model size, architecture type, and different prediction targets (state-values, action-values, and behavioral cloning).
Researcher Affiliation	Industry	1Google Deep Mind. 2Google. Correspondence to {anianr, timgen}@google.com.
Pseudocode	No	The paper describes the methods in text and figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	We open source our Chess Bench dataset, our model weights, and all training and evaluation code at https://github.com/google-deepmind/searchless_chess
Open Datasets	Yes	To construct a training dataset for supervised learning we downloaded 10 million games from Lichess (lichess.org) from February 2023. We introduce Chess Bench, a large-scale benchmark dataset for chess, consisting of 530M board states (from 10M games on lichess.org)... We open source our Chess Bench dataset, our model weights, and all training and evaluation code at https://github.com/google-deepmind/searchless_chess
Dataset Splits	No	The paper describes training and test sets but does not explicitly define a separate 'validation' split. It refers to 'test loss' which might be used for validation purposes, but the terminology for a distinct validation split is not present.
Hardware Specification	Yes	We used 4 Tensor Processing Units (V5) per model for the ablation experiments. We used 128 Tensor Processing Units (V5) per model to train our large (9M, 136M and 270M) models. We used a single Tensor Processing Unit (V3) per agent for our Elo tournament... to compute the Lichess Elo, we use a 6-core Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, and (ii) to compute the tournament Elo, we use a single Tensor Processing Unit (V3), as for all the other agents.
Software Dependencies	No	Our codebase is based on JAX [51] and the Deep Mind JAX Ecosystem [52, 53]. While these frameworks are mentioned, specific version numbers for JAX, Deep Mind JAX Ecosystem, or Haiku are not provided in the text or the linked references.
Experiment Setup	Yes	We train for 10 million steps with a batch size of 4096... We use the Adam optimizer [17] with a learning rate of 1 10 4... We use three different model configurations (with a widening factor of 4): (i) 8 heads, 8 layers, and an embedding dimension of 256, (ii) 8 heads, 8 layers, and an embedding dimension of 1024, and (iii) 8 heads, 16 layers, and an embedding dimension of 1024... If not mentioned otherwise, K = 128. For our behavioral cloning experiments we train to directly predict the oracle actions, which are already discrete. We train our predictors by minimizing the cross-entropy loss via mini-batch stochastic gradient descent using Adam [17]. For stateand action-value prediction, we additionally apply label smoothing via the HL-Gauss loss [18], using a Gaussian smoothing distribution with the mean given by the label and a standard deviation of σ = 0.75/K 0.05 as recommended by Farebrother et al. [19].