reproducibilityindex.ai

Exploration in Reinforcement Learning with Deep Covering Options

Authors: Yuu Jinnai, Jee Won Park, Marlos C. Machado, George Konidaris

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
Researcher Affiliation	Collaboration	Yuu Jinnai Brown University Jee Won Park Brown University Marlos C. Machado Google Brain George Konidaris Brown University
Pseudocode	Yes	Algorithm 1 Deep covering options
Open Source Code	No	The paper does not provide any specific links or explicit statements about the release of its source code.
Open Datasets	Yes	Pinball In the Pinball domain the goal is to maneuver a small ball from a start state to a goal state (Figure 3a; Konidaris and Barto, 2009). Mu Jo Co control tasks (Todorov et al., 2012). three Atari games (Bellemare et al., 2013).
Dataset Splits	No	The paper describes training procedures and sampling trajectories but does not specify explicit train/validation/test dataset splits with percentages or sample counts. For example, 'We evaluated with the threshold percentile k = {5, 10, 30, 50} and selected 30 as it performed the best (Table 1).' describes parameter tuning, not a dataset split.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several software components and algorithms (e.g., Q-learning, Fourier basis, DIAYN, DDPG, Double Deep Q-learning, Adam optimizer, MuJoCo, Arcade Learning Environment) but does not provide specific version numbers for these software dependencies or libraries.
Experiment Setup	Yes	We used Q-learning (α = 0.1, γ = 0.99, ϵ = 0.05). We set the percentile to k = 30. We set the Lagrange multiplier η to 1.0. The actor and the critic are implemented with 3 hidden layers with 256 units each followed by a Re LU activation function with Adam optimizer with a step size of 0.005. We used neural networks for the actor and the critic as it outperformed an agent with the actor and the critic implemented by a linear approximator using 3rd order Fourier basis Konidaris et al. (2011). Our discriminator network consists of 2 hidden layers with 256 units each followed by a Re LU activation function. We trained the discriminator with Adam optimizer with a step size of 0.001. We set the threshold percentile k = 10. The Q-network consists of two fully connected layers with 400 units with a batch normalization and a Re LU in between. We trained it with the Adam optimizer using a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. We set the threshold percentile k = 4. We set the Lagrange multiplier η to 1.0. The eigenfunction is learned with a convolutional neural network with 2 convolution layers (32 8x8 filters with stride 4 and 64 4x4 filters with stride 2) and a fully-connected hidden layer with 400 units and Re LU in between. We trained it with Adam optimizer with a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05.