Large-Scale Retrieval for Reinforcement Learning

Authors: Peter Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, Timothy Lillicrap

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this approach in an offline RL setting for an environment with a combinatorial state space the game of 9x9 Go
Researcher Affiliation Industry Deepmind, London {peterhumphreys, aguez, ...}@google.com
Pseudocode Yes Algorithm 1 Training semi-parametric action-conditional model
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Instructions and links to relevant libraries were provided.
Open Datasets No We collected a dataset of 3.5M expert 9x9 Go self-play games from an Alpha Zero-style agent [40]. While the source of the data generation agent is cited, the paper does not explicitly state that this collected dataset of 50M board-state observations is publicly available nor provides a direct access link to it.
Dataset Splits No The paper states: 'During training, we split Dr in two halves such that each game s observations are only in one of the datasets. We retrieve neighbors for an observation ot from the half it is not contained in. This is simply to avoid retrieving the same position as the query.' This describes a data splitting strategy but does not explicitly define a 'validation' split for hyperparameter tuning or model selection.
Hardware Specification Yes We conducted all experiments on the DeepMind JAX Ecosystem [3], using TPUs [17]
Software Dependencies No The paper mentions using the 'DeepMind JAX Ecosystem' but does not specify software versions (e.g., JAX version, Python version, specific library versions).
Experiment Setup Yes We train a model-based agent [36] that predicts future policies and values conditioned on future actions in a given state. This semiparametric model incorporates a retrieval mechanism, which allows it to utilise information from a large-scale dataset to inform its predictions. We train the agent in a supervised offline RL setting... We use MCTS with a p UCT rule for internal action selection [35, 40], carrying out nsims simulations per time step.