reproducibilityindex.ai

Dead-ends and Secure Exploration in Reinforcement Learning

Authors: Mehdi Fatemi, Shikhar Sharma, Harm Van Seijen, Samira Ebrahimi Kahou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically compare secure random-walk with standard benchmarks in two sets of experiments including the Atari game of Montezuma s Revenge.
Researcher Affiliation	Collaboration	1Microsoft Research, 2000 Mc Gill College Avenue, Suite 550, Montr eal, QC H3A 3H3, Canada 2Mc Gill University, 845 Sherbrooke Street West, Montr eal, QC H3A 0G4, Canada.
Pseudocode	Yes	Algorithm 1 Q-learning with secure random-walk.
Open Source Code	Yes	Code is available at https://github.com/Maluuba/srw.
Open Datasets	Yes	Surprisingly, several Atari 2600 games in the ALE suit (Bellemare et al., 2013), which look nearly unsolvable using DQN and other similar methods are environments that indeed suffer from the bridge effect. In speciﬁc, at the bottom of the score list in (Mnih et al., 2015), 5 out of 9 games may receive better results by using secure random-walk exploration. Most notably is of course Montezuma s Revenge.
Dataset Splits	No	The paper does not provide specific training, validation, or test dataset splits. It describes experiments conducted in reinforcement learning environments (Bridge game, Montezuma's Revenge) where the agent interacts with the environment, rather than splitting a static dataset into distinct sets.
Hardware Specification	No	The paper mentions 'GPU clusters' as the hardware used ('enabled us to use the GPU clusters'), but it does not provide specific details such as GPU models, CPU models, or memory specifications.
Software Dependencies	No	The paper refers to algorithms and frameworks like 'DQN' and 'Q-learning' but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	To ensure stability, a small enough step-size has to be used due to stochasticity of the environment. We use α = 0.1, 0.01, and 0.001 for Boltzmann, count-based, and ϵ-greedy, respectively, all without annealing.