Human-Timescale Adaptation in an Open-Ended Task Space

Authors: Jakob Bauer, Kate Baumli, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schmieg, Michael Chang, Natalie Clay, Adrian Collister, Vibhavari Dasagi, Lucy Gonzalez, Karol Gregor, Edward Hughes, Sheleem Kashem, Maria Loks-Thompson, Hannah Openshaw, Jack Parker-Holder, Shreya Pathak, Nicolas Perez-Nieves, Nemanja Rakicevic, Tim Rocktäschel, Yannick Schroecker, Satinder Singh, Jakub Sygnowski, Karl Tuyls, Sarah York, Alexander Zacherl, Lei M Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (Ad A) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. We evaluate our agents in two distinct regimes: on a set of 1000 test tasks sampled from the same distribution as the training tasks, and on a set of 30 single-agent and 28 multi-agent hand-authored probe tasks.
Researcher Affiliation Industry Jakob Bauer 1 Kate Baumli 1 Feryal Behbahani 1 Avishkar Bhoopchand 1 Nathalie Bradley-Schmieg 1 Michael Chang 1 Natalie Clay 1 Adrian Collister 1 Vibhavari Dasagi 1 Lucy Gonzalez 1 Karol Gregor 1 Edward Hughes 1 Sheleem Kashem 1 Maria Loks-Thompson 1 Hannah Openshaw 1 Jack Parker-Holder 1 Shreya Pathak 1 Nicolas Perez-Nieves 1 Nemanja Rakicevic 1 Tim Rockt aschel 1 Yannick Schroecker 1 Satinder Singh 1 Jakub Sygnowski 1 Karl Tuyls 1 Sarah York 1 Alexander Zacherl 1 Lei Zhang 1 Foundation models have shown impressive adaptation and scalability in supervised and selfsupervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (Ad A) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attentionbased memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent s capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across everlarger open-ended domains. 1Deep Mind. Correspondence to: Feryal Behbahani <feryal@google.com>, Edward Hughes <edwardhughes@google.com>.
Pseudocode No Not found. The paper describes the algorithms used (Muesli, No-op filtering, PLR) but does not present them in pseudocode blocks.
Open Source Code No Not found. The paper mentions videos on their microsite but does not state that the source code for their method is open-source or provide a link.
Open Datasets Yes To make this possible in an RL setting, where agents collect their own data, we extend the recent XLand environment (OEL Team et al., 2021), producing a vast open-ended world with over 1040 possible tasks.
Dataset Splits No We evaluate our agents in two distinct regimes: on a set of 1000 test tasks sampled from the same distribution as the training tasks, and on a set of 30 single-agent and 28 multi-agent hand-authored probe tasks. The total achievable reward on each task varies, so whenever we present aggregated results on the test or hand-authored task set, we normalise the total per-trial reward for each task against the reward obtained by fine-tuning Ad A on the respective task set (see Appendix G for details).
Hardware Specification Yes Ad A was implemented using JAX (Bradbury et al., 2018) and trained on 64 TPU devices.
Software Dependencies No Ad A was implemented using JAX (Bradbury et al., 2018) and trained on 64 TPU devices.
Experiment Setup Yes Table F.1 details the full learning algorithm hyperparameters. Table G.1 details the experimental setup for experiments in Section 3.1.