Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large-Scale Retrieval for Reinforcement Learning

Authors: Peter Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, Timothy Lillicrap

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this approach in an ofﬂine RL setting for an environment with a combinatorial state space the game of 9x9 Go
Researcher Affiliation	Industry	Deepmind, London EMAIL
Pseudocode	Yes	Algorithm 1 Training semi-parametric action-conditional model
Open Source Code	No	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Instructions and links to relevant libraries were provided.
Open Datasets	No	We collected a dataset of 3.5M expert 9x9 Go self-play games from an Alpha Zero-style agent [40]. While the source of the data generation agent is cited, the paper does not explicitly state that this collected dataset of 50M board-state observations is publicly available nor provides a direct access link to it.
Dataset Splits	No	The paper states: 'During training, we split Dr in two halves such that each game s observations are only in one of the datasets. We retrieve neighbors for an observation ot from the half it is not contained in. This is simply to avoid retrieving the same position as the query.' This describes a data splitting strategy but does not explicitly define a 'validation' split for hyperparameter tuning or model selection.
Hardware Specification	Yes	We conducted all experiments on the DeepMind JAX Ecosystem [3], using TPUs [17]
Software Dependencies	No	The paper mentions using the 'DeepMind JAX Ecosystem' but does not specify software versions (e.g., JAX version, Python version, specific library versions).
Experiment Setup	Yes	We train a model-based agent [36] that predicts future policies and values conditioned on future actions in a given state. This semiparametric model incorporates a retrieval mechanism, which allows it to utilise information from a large-scale dataset to inform its predictions. We train the agent in a supervised ofﬂine RL setting... We use MCTS with a p UCT rule for internal action selection [35, 40], carrying out nsims simulations per time step.