Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mastering Board Games by External and Internal Planning with Language Models

Authors: John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada A. Lewis, Anian Ruoss, Tom Zahavy, Petar Veličković, Laurel Prince, Satinder Singh, Eric Malmi, Nenad Tomasev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
Researcher Affiliation Collaboration 1Google Deep Mind 2ETH Z urich 3Google. Correspondence to: Eric Malmi <EMAIL>, Nenad Tomasev <EMAIL>.
Pseudocode Yes The external search algorithm guided by a learned world model is summarized in Algorithm 1, with subroutines contained in Appendix A. Algorithm 1 EXTERNAL-MCTS(s0)
Open Source Code No The paper does not provide an explicit statement from the authors releasing their code or a direct link to their repository for the methodology described in this paper.
Open Datasets Yes We curate a dataset of diverse, relevant positions in four games: Chess, Chess960, Connect Four, and Hex. The statistics and sources for each of these datasets are shown in Table 4 in Appendix B. Each position is used to produce a single training example, randomly varying (i) the k action values in %top k, (ii) the presence of the initial or final state tracking commands, (iii) the use and order of %state or %FEN representations in chess. Further details on the datasets are provided in Appendix B.
Dataset Splits No The paper mentions training and evaluation data but does not provide specific details on how their curated datasets were split into training, validation, and test sets (e.g., exact percentages or sample counts). While it states "We leveraged the pre-trained MAV and fine-tuned it using a mixture of 60% MAV data and 40% search data," this describes the composition for a fine-tuning step rather than the primary dataset splits for the MAV model itself.
Hardware Specification No The paper mentions hardware specifications for running Stockfish (e.g., "-DUSE AVX2 -DUSE PEXT -march=haswell"), but does not provide specific hardware details (like GPU models, CPU models, or memory) used for training or running their own LLMs (MAV and MAV small).
Software Dependencies No The paper mentions several software components like "Gemini architecture," "Stockfish 16," "Open Spiel," "Fhourstones," "neurobenzene," and "Python s concurrent.futures module." However, it only provides a specific version number for Stockfish (16) and does not specify versions for other key software or libraries.
Experiment Setup Yes Fine-tuning was run for 20,000 steps, using a batch size of 512. For our external-MCTS agents, we tune hyper-parameters using a combination of manual and head-to-head comparisons. We found that setting τ = 0.1, k = 5, and ε = 0.05 worked best for the prior (Equation 1). For Async MCTS we used a batch size b = 16. We used a timeout of t0 = 60 seconds, and (only as a baseline) static virtual loss parameter nc = 10. We used dynamic virtual count values nmin = 2 in all cases, and nmax varying with the number of simulations: (M, nmax) {(100, 8), (250, 8), (500, 8), (1000, 16), (2000, 32)}.