Temporally-Extended ε-Greedy Exploration
Authors: Will Dabney, Georg Ostrovski, Andre Barreto
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present empirical results on tabular, linear, and deep RL settings, pursuing two objectives: The first is to demonstrate the generality of our method in applying it across domains as well as across multiple value-based reinforcement learning algorithms (Q-learning, SARSA, Rainbow, R2D2). |
| Researcher Affiliation | Industry | Will Dabney, Georg Ostrovski & Andr e Barreto Deep Mind London, UK {wdabney,ostrovski,andrebarreto}@google.com |
| Pseudocode | Yes | Algorithm 1 ϵz-Greedy exploration policy |
| Open Source Code | No | The paper does not provide any specific links or statements about releasing the source code for the described methodology. |
| Open Datasets | Yes | Atari-57: Deep RL Motivated by the results in tabular and linear settings, we now turn to deep RL and evaluate performance on 57 Atari 2600 games in the Arcade Learning Environment (ALE) (Bellemare et al., 2013). |
| Dataset Splits | No | The paper mentions evaluation phases during training (e.g., 'every 1M environment frames learning is frozen and the agent is evaluated for 500K environment frames') but does not explicitly provide quantitative training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits for reproducibility). |
| Hardware Specification | Yes | Rainbow-based agents were implemented in Python using JAX, with each configuration (game, algorithm, hyper-parameter setting) run on a single V100 GPU. |
| Software Dependencies | No | The paper mentions software like Python, JAX, and TensorFlow but does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | Unless stated otherwise, hyper-parameters for our Rainbow-based agents follow the original implementation in Hessel et al. (2018), see Table 2. An exception is the Rainbow-CTS agent, which uses a regular dueling value network instead of the Noisy Nets variant, and also makes use of an ϵ-greedy policy (whereas the baseline Rainbow relies on its Noisy Nets value head for exploration). The ϵ parameter follows a linear decay schedule 1.0 to 0.01 over the course of the first 4M frames, remaining constant after that. Evaluation happens with an even lower value of ϵ = 0.001. |