Dead-ends and Secure Exploration in Reinforcement Learning

Authors: Mehdi Fatemi, Shikhar Sharma, Harm Van Seijen, Samira Ebrahimi Kahou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically compare secure random-walk with standard benchmarks in two sets of experiments including the Atari game of Montezuma s Revenge.
Researcher Affiliation Collaboration 1Microsoft Research, 2000 Mc Gill College Avenue, Suite 550, Montr eal, QC H3A 3H3, Canada 2Mc Gill University, 845 Sherbrooke Street West, Montr eal, QC H3A 0G4, Canada.
Pseudocode Yes Algorithm 1 Q-learning with secure random-walk.
Open Source Code Yes Code is available at https://github.com/Maluuba/srw.
Open Datasets Yes Surprisingly, several Atari 2600 games in the ALE suit (Bellemare et al., 2013), which look nearly unsolvable using DQN and other similar methods are environments that indeed suffer from the bridge effect. In specific, at the bottom of the score list in (Mnih et al., 2015), 5 out of 9 games may receive better results by using secure random-walk exploration. Most notably is of course Montezuma s Revenge.
Dataset Splits No The paper does not provide specific training, validation, or test dataset splits. It describes experiments conducted in reinforcement learning environments (Bridge game, Montezuma's Revenge) where the agent interacts with the environment, rather than splitting a static dataset into distinct sets.
Hardware Specification No The paper mentions 'GPU clusters' as the hardware used ('enabled us to use the GPU clusters'), but it does not provide specific details such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper refers to algorithms and frameworks like 'DQN' and 'Q-learning' but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes To ensure stability, a small enough step-size has to be used due to stochasticity of the environment. We use α = 0.1, 0.01, and 0.001 for Boltzmann, count-based, and ϵ-greedy, respectively, all without annealing.