BYOL-Explore: Exploration by Bootstrapped Prediction

Authors: Zhaohan Guo, Shantanu Thakoor, Miruna Pislar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Remi Munos, Mohammad Gheshlaghi Azar, Bilal Piot

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
Researcher Affiliation Industry Zhaohan Daniel Guo Deep Mind danielguo@deepmind.com Shantanu Thakoor Deep Mind Miruna Pîslar Deep Mind Bernardo Avila Pires Deep Mind Florent Altché Deep Mind Corentin Tallec Deep Mind Alaa Saade Deep Mind Daniele Calandriello Deep Mind Jean-Bastien Grill Deep Mind Yunhao Tang Deep Mind Michal Valko Deep Mind Rémi Munos Deep Mind Mohammad Gheshlaghi Azar Deep Mind mazar@deepmind.com Bilal Piot Deep Mind piot@deepmind.com
Pseudocode No The paper includes a neural architecture diagram (Figure 1) but no pseudocode or algorithm block.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code is proprietary
Open Datasets Yes Atari Learning Environment [6]. This is a widely used RL benchmark... Hard-Eight Suite [22]. This benchmark comprises 8 hard exploration tasks
Dataset Splits No The paper evaluates on benchmark task-suites (Atari, DM-HARD-8) which are environments for RL, rather than traditional static datasets with explicit train/validation/test splits. It discusses training within these environments and evaluating performance, but does not specify data split percentages or counts for validation sets.
Hardware Specification Yes All experiments were run on a single machine with 16 GPUs NVIDIA A100.
Software Dependencies No All models are implemented in JAX [1] and Haiku [2], and trained using Optax [3] for optimization. The paper mentions software tools (JAX, Haiku, Optax) but does not provide specific version numbers for them.
Experiment Setup Yes At a high level, BYOL-Explore has 4 main hyper-parameters: the target network EMA parameter α, the open-loop horizon K, choosing to clip rewards and to share the BYOL-Explore representation with the RL network. ... Further details regarding the RL algorithm setup and hyperparameters are provided in Appendix C.