Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generating Creative Chess Puzzles

Authors: Xidong Feng, Vivek Veeriah, Marcus Chiam, Michael Dennis, Federico Barbero, Johan Obando Ceron, Jiaxin Shi, Satinder P. Singh, Shaobo Hou, Nenad Tomasev, Tom Zahavy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents an approach to tackle these difficulties in the domain of chess puzzles. We start by benchmarking Generative AI architectures, and then introduce an RL framework with novel rewards based on chess engine search statistics to overcome some of those shortcomings. Our RL approach dramatically increases counter-intuitive puzzle generation by 10x, from 0.22% (supervised) to 2.5%, surpassing existing dataset rates (2.1%) and the best Lichess-trained model (0.4%). Our puzzles meet novelty and diversity benchmarks, retain aesthetic themes, and are rated by human experts as more creative, enjoyable, and counterintuitive than composed book puzzles, even approaching classic compositions. Section 3: Experiments
Researcher Affiliation	Industry	Xidong Feng Vivek Veeriah Marcus Chiam Michael Dennis Federico Barbero Johan Obando-Ceron Jiaxin Shi Satinder Singh Shaobo Hou Nenad Tomašev Tom Zahavy Google Deep Mind University of Oxford Mila, University of Montreal EMAIL. All explicitly named authors are primarily affiliated with Google DeepMind, an industry entity, as indicated by the email domains and the general context of the paper.
Pseudocode	No	The paper describes methods and equations for quantification and reward functions (e.g., Eq. 1-7) and architectural details for models, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Our codebase is built heavily on internal infrastructure. We can open source them after we adapt them. But we detail our experimental settings and configurations, which should be enough to reproduce easily.
Open Datasets	Yes	We experimented with different candidates including auto-regressive transformer [91], latent diffusion models [70], masked discrete diffusion models [77], and Mask GIT [11] all trained on the Lichess Puzzler dataset. We compare our models with two baselines: the lichess dataset and a standard game dataset (2024-11) [51]. lichess.org. Lichess puzzle dataset, 2025. URL https://database.lichess.org/ #puzzles.
Dataset Splits	Yes	The models were trained exclusively on the Lichess Puzzler [52] dataset (without any pre-training), excluding any chess compositions or other datasets. We split train and test set by 99% and 1%, resulting in 4.36M train samples and 44k test samples correspondingly.
Hardware Specification	No	We spend most of the compute over CPUs. For the final RL experiment, we run the Stock Fish on 28M chessboards in total with 4096 CPUs, corresponding to 175k CPU hours (each position takes 15-30s, we choose average 22.5s for calculation).
Software Dependencies	No	We use the Stock Fish engine... Version 17.1 (as of October 23, 2025). (from reference [86]) and We adopt the Adam W optimizer. However, specific versions for other key software components like Python or deep learning frameworks are not provided.
Experiment Setup	Yes	Autoregressive Transformer We use 8 heads, 16 layers, and an embedding dimension of 1024, resulting in 200M parameters in total. We adopt the Adam W optimizer, with 1e-4 learing rate and 1e-4 weight decay coefficient. We conduct the training with 1024 batch size and 100k steps. For training the diffusion model, we used the Adam optimizer with a learning rate of 2e-4. We used a batch size of 512. The training followed the Denoising Diffusion Probabilistic Models (DDPM) [32] framework. The diffusion model was trained for 300,000 steps.