Large-Scale Retrieval for Reinforcement Learning
Authors: Peter Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, Timothy Lillicrap
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this approach in an offline RL setting for an environment with a combinatorial state space the game of 9x9 Go |
| Researcher Affiliation | Industry | Deepmind, London {peterhumphreys, aguez, ...}@google.com |
| Pseudocode | Yes | Algorithm 1 Training semi-parametric action-conditional model |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Instructions and links to relevant libraries were provided. |
| Open Datasets | No | We collected a dataset of 3.5M expert 9x9 Go self-play games from an Alpha Zero-style agent [40]. While the source of the data generation agent is cited, the paper does not explicitly state that this collected dataset of 50M board-state observations is publicly available nor provides a direct access link to it. |
| Dataset Splits | No | The paper states: 'During training, we split Dr in two halves such that each game s observations are only in one of the datasets. We retrieve neighbors for an observation ot from the half it is not contained in. This is simply to avoid retrieving the same position as the query.' This describes a data splitting strategy but does not explicitly define a 'validation' split for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | We conducted all experiments on the DeepMind JAX Ecosystem [3], using TPUs [17] |
| Software Dependencies | No | The paper mentions using the 'DeepMind JAX Ecosystem' but does not specify software versions (e.g., JAX version, Python version, specific library versions). |
| Experiment Setup | Yes | We train a model-based agent [36] that predicts future policies and values conditioned on future actions in a given state. This semiparametric model incorporates a retrieval mechanism, which allows it to utilise information from a large-scale dataset to inform its predictions. We train the agent in a supervised offline RL setting... We use MCTS with a p UCT rule for internal action selection [35, 40], carrying out nsims simulations per time step. |