Exploration in Reinforcement Learning with Deep Covering Options

Authors: Yuu Jinnai, Jee Won Park, Marlos C. Machado, George Konidaris

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
Researcher Affiliation Collaboration Yuu Jinnai Brown University Jee Won Park Brown University Marlos C. Machado Google Brain George Konidaris Brown University
Pseudocode Yes Algorithm 1 Deep covering options
Open Source Code No The paper does not provide any specific links or explicit statements about the release of its source code.
Open Datasets Yes Pinball In the Pinball domain the goal is to maneuver a small ball from a start state to a goal state (Figure 3a; Konidaris and Barto, 2009). Mu Jo Co control tasks (Todorov et al., 2012). three Atari games (Bellemare et al., 2013).
Dataset Splits No The paper describes training procedures and sampling trajectories but does not specify explicit train/validation/test dataset splits with percentages or sample counts. For example, 'We evaluated with the threshold percentile k = {5, 10, 30, 50} and selected 30 as it performed the best (Table 1).' describes parameter tuning, not a dataset split.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions several software components and algorithms (e.g., Q-learning, Fourier basis, DIAYN, DDPG, Double Deep Q-learning, Adam optimizer, MuJoCo, Arcade Learning Environment) but does not provide specific version numbers for these software dependencies or libraries.
Experiment Setup Yes We used Q-learning (α = 0.1, γ = 0.99, ϵ = 0.05). We set the percentile to k = 30. We set the Lagrange multiplier η to 1.0. The actor and the critic are implemented with 3 hidden layers with 256 units each followed by a Re LU activation function with Adam optimizer with a step size of 0.005. We used neural networks for the actor and the critic as it outperformed an agent with the actor and the critic implemented by a linear approximator using 3rd order Fourier basis Konidaris et al. (2011). Our discriminator network consists of 2 hidden layers with 256 units each followed by a Re LU activation function. We trained the discriminator with Adam optimizer with a step size of 0.001. We set the threshold percentile k = 10. The Q-network consists of two fully connected layers with 400 units with a batch normalization and a Re LU in between. We trained it with the Adam optimizer using a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. We set the threshold percentile k = 4. We set the Lagrange multiplier η to 1.0. The eigenfunction is learned with a convolutional neural network with 2 convolution layers (32 8x8 filters with stride 4 and 64 4x4 filters with stride 2) and a fully-connected hidden layer with 400 units and Re LU in between. We trained it with Adam optimizer with a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05.