Temporally-Extended ε-Greedy Exploration

Authors: Will Dabney, Georg Ostrovski, Andre Barreto

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present empirical results on tabular, linear, and deep RL settings, pursuing two objectives: The first is to demonstrate the generality of our method in applying it across domains as well as across multiple value-based reinforcement learning algorithms (Q-learning, SARSA, Rainbow, R2D2).
Researcher Affiliation Industry Will Dabney, Georg Ostrovski & Andr e Barreto Deep Mind London, UK {wdabney,ostrovski,andrebarreto}@google.com
Pseudocode Yes Algorithm 1 ϵz-Greedy exploration policy
Open Source Code No The paper does not provide any specific links or statements about releasing the source code for the described methodology.
Open Datasets Yes Atari-57: Deep RL Motivated by the results in tabular and linear settings, we now turn to deep RL and evaluate performance on 57 Atari 2600 games in the Arcade Learning Environment (ALE) (Bellemare et al., 2013).
Dataset Splits No The paper mentions evaluation phases during training (e.g., 'every 1M environment frames learning is frozen and the agent is evaluated for 500K environment frames') but does not explicitly provide quantitative training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits for reproducibility).
Hardware Specification Yes Rainbow-based agents were implemented in Python using JAX, with each configuration (game, algorithm, hyper-parameter setting) run on a single V100 GPU.
Software Dependencies No The paper mentions software like Python, JAX, and TensorFlow but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup Yes Unless stated otherwise, hyper-parameters for our Rainbow-based agents follow the original implementation in Hessel et al. (2018), see Table 2. An exception is the Rainbow-CTS agent, which uses a regular dueling value network instead of the Noisy Nets variant, and also makes use of an ϵ-greedy policy (whereas the baseline Rainbow relies on its Noisy Nets value head for exploration). The ϵ parameter follows a linear decay schedule 1.0 to 0.01 over the course of the first 4M frames, remaining constant after that. Evaluation happens with an even lower value of ϵ = 0.001.