Unlocking the Power of Representations in Long-term Novelty-based Exploration

Authors: Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of episodes. We further propose a novel generalization of the inverse dynamics loss, which leverages masked transformer architectures for multi-step prediction; which in conjunction with RECODE achieves a new state-of-the-art in a suite of challenging 3D-exploration tasks in DM-HARD-8. RECODE also attains state-of-the-art performance in hard exploration Atari games, and is the first agent to reach the end screen in Pitfall!
Researcher Affiliation Collaboration Alaa Saade , Steven Kapturowski , Daniele Calandriello , Charles Blundell, Pablo Sprechmann, Leopoldo Sarra , Oliver Groth, Michal Valko, Bilal Piot. Google Deepmind {alaas,skapturowski,dcalandriello, cblundell,psprechmann,leopoldo.sarra,ogroth,valkom,piot}@google.com Equal contributions, Department of Physics, Friedrich-Alexander Universität Erlangen-Nürnberg, work done while interning at Deep Mind.
Pseudocode Yes Algorithm 1 RECODE... Algorithm 2 A streaming clustering algorithm.
Open Source Code No The paper describes the methods and experiments in detail but does not provide an explicit statement about releasing its source code or a link to a code repository for its contributions (RECODE or CASM).
Open Datasets Yes In this section, we experimentally validate the efficacy of our approach on two established benchmarks for exploration in 2D and 3D respectively: a subset of the Atari Learning Environment (ALE, Bellemare et al., 2013) containing eight games such as Pitfall and Montezuma s Revenge which are considered hard exploration problems (Bellemare et al., 2016); and DM-HARD-8 (Gulcehre et al., 2019), a suite of partially observable 3D games.
Dataset Splits No The paper describes experimental setups (e.g., 'average performance over 6 seeds' for Atari, 'averaged across three seeds' for DM-HARD-8) and uses established benchmarks like ALE and DM-HARD-8, but it does not specify explicit training/validation/test *dataset splits* in the way this concept is typically applied to static datasets.
Hardware Specification Yes One seed for an Atari experiments (e.g., for MEME-RECODE-AP and MEME-NGU-AP) took 24h to execute using multiple servers with a total of 64 CPUs, 1TB RAM, and 5 TPUv4. One seed for a DM-HARD-8 experiment (e.g., for MEME-RECODE-CASM and MEME-NGU-CASM) took 90h to execute using multiple servers with a total of 512 CPUs, 1TB RAM, and 5 TPUv4.
Software Dependencies No The paper mentions implementing methods in a 'distributed setting' and using 'MEME' and 'VMPO-based agents', but it does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, or other library versions).
Experiment Setup Yes We also report here the precise hyper-parameter values used in our experiment, Table 1 for Atari and Table 2 for DM-HARD-8 We omit hypers which do not differ from the base MEME agent Kapturowski et al. (2022).