Dead-ends and Secure Exploration in Reinforcement Learning
Authors: Mehdi Fatemi, Shikhar Sharma, Harm Van Seijen, Samira Ebrahimi Kahou
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically compare secure random-walk with standard benchmarks in two sets of experiments including the Atari game of Montezuma s Revenge. |
| Researcher Affiliation | Collaboration | 1Microsoft Research, 2000 Mc Gill College Avenue, Suite 550, Montr eal, QC H3A 3H3, Canada 2Mc Gill University, 845 Sherbrooke Street West, Montr eal, QC H3A 0G4, Canada. |
| Pseudocode | Yes | Algorithm 1 Q-learning with secure random-walk. |
| Open Source Code | Yes | Code is available at https://github.com/Maluuba/srw. |
| Open Datasets | Yes | Surprisingly, several Atari 2600 games in the ALE suit (Bellemare et al., 2013), which look nearly unsolvable using DQN and other similar methods are environments that indeed suffer from the bridge effect. In specific, at the bottom of the score list in (Mnih et al., 2015), 5 out of 9 games may receive better results by using secure random-walk exploration. Most notably is of course Montezuma s Revenge. |
| Dataset Splits | No | The paper does not provide specific training, validation, or test dataset splits. It describes experiments conducted in reinforcement learning environments (Bridge game, Montezuma's Revenge) where the agent interacts with the environment, rather than splitting a static dataset into distinct sets. |
| Hardware Specification | No | The paper mentions 'GPU clusters' as the hardware used ('enabled us to use the GPU clusters'), but it does not provide specific details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper refers to algorithms and frameworks like 'DQN' and 'Q-learning' but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | To ensure stability, a small enough step-size has to be used due to stochasticity of the environment. We use α = 0.1, 0.01, and 0.001 for Boltzmann, count-based, and ϵ-greedy, respectively, all without annealing. |