Unlocking the Power of Representations in Long-term Novelty-based Exploration
Authors: Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of episodes. We further propose a novel generalization of the inverse dynamics loss, which leverages masked transformer architectures for multi-step prediction; which in conjunction with RECODE achieves a new state-of-the-art in a suite of challenging 3D-exploration tasks in DM-HARD-8. RECODE also attains state-of-the-art performance in hard exploration Atari games, and is the first agent to reach the end screen in Pitfall! |
| Researcher Affiliation | Collaboration | Alaa Saade , Steven Kapturowski , Daniele Calandriello , Charles Blundell, Pablo Sprechmann, Leopoldo Sarra , Oliver Groth, Michal Valko, Bilal Piot. Google Deepmind {alaas,skapturowski,dcalandriello, cblundell,psprechmann,leopoldo.sarra,ogroth,valkom,piot}@google.com Equal contributions, Department of Physics, Friedrich-Alexander Universität Erlangen-Nürnberg, work done while interning at Deep Mind. |
| Pseudocode | Yes | Algorithm 1 RECODE... Algorithm 2 A streaming clustering algorithm. |
| Open Source Code | No | The paper describes the methods and experiments in detail but does not provide an explicit statement about releasing its source code or a link to a code repository for its contributions (RECODE or CASM). |
| Open Datasets | Yes | In this section, we experimentally validate the efficacy of our approach on two established benchmarks for exploration in 2D and 3D respectively: a subset of the Atari Learning Environment (ALE, Bellemare et al., 2013) containing eight games such as Pitfall and Montezuma s Revenge which are considered hard exploration problems (Bellemare et al., 2016); and DM-HARD-8 (Gulcehre et al., 2019), a suite of partially observable 3D games. |
| Dataset Splits | No | The paper describes experimental setups (e.g., 'average performance over 6 seeds' for Atari, 'averaged across three seeds' for DM-HARD-8) and uses established benchmarks like ALE and DM-HARD-8, but it does not specify explicit training/validation/test *dataset splits* in the way this concept is typically applied to static datasets. |
| Hardware Specification | Yes | One seed for an Atari experiments (e.g., for MEME-RECODE-AP and MEME-NGU-AP) took 24h to execute using multiple servers with a total of 64 CPUs, 1TB RAM, and 5 TPUv4. One seed for a DM-HARD-8 experiment (e.g., for MEME-RECODE-CASM and MEME-NGU-CASM) took 90h to execute using multiple servers with a total of 512 CPUs, 1TB RAM, and 5 TPUv4. |
| Software Dependencies | No | The paper mentions implementing methods in a 'distributed setting' and using 'MEME' and 'VMPO-based agents', but it does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, or other library versions). |
| Experiment Setup | Yes | We also report here the precise hyper-parameter values used in our experiment, Table 1 for Atari and Table 2 for DM-HARD-8 We omit hypers which do not differ from the base MEME agent Kapturowski et al. (2022). |