Exploration in Reinforcement Learning with Deep Covering Options
Authors: Yuu Jinnai, Jee Won Park, Marlos C. Machado, George Konidaris
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward. |
| Researcher Affiliation | Collaboration | Yuu Jinnai Brown University Jee Won Park Brown University Marlos C. Machado Google Brain George Konidaris Brown University |
| Pseudocode | Yes | Algorithm 1 Deep covering options |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the release of its source code. |
| Open Datasets | Yes | Pinball In the Pinball domain the goal is to maneuver a small ball from a start state to a goal state (Figure 3a; Konidaris and Barto, 2009). Mu Jo Co control tasks (Todorov et al., 2012). three Atari games (Bellemare et al., 2013). |
| Dataset Splits | No | The paper describes training procedures and sampling trajectories but does not specify explicit train/validation/test dataset splits with percentages or sample counts. For example, 'We evaluated with the threshold percentile k = {5, 10, 30, 50} and selected 30 as it performed the best (Table 1).' describes parameter tuning, not a dataset split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components and algorithms (e.g., Q-learning, Fourier basis, DIAYN, DDPG, Double Deep Q-learning, Adam optimizer, MuJoCo, Arcade Learning Environment) but does not provide specific version numbers for these software dependencies or libraries. |
| Experiment Setup | Yes | We used Q-learning (α = 0.1, γ = 0.99, ϵ = 0.05). We set the percentile to k = 30. We set the Lagrange multiplier η to 1.0. The actor and the critic are implemented with 3 hidden layers with 256 units each followed by a Re LU activation function with Adam optimizer with a step size of 0.005. We used neural networks for the actor and the critic as it outperformed an agent with the actor and the critic implemented by a linear approximator using 3rd order Fourier basis Konidaris et al. (2011). Our discriminator network consists of 2 hidden layers with 256 units each followed by a Re LU activation function. We trained the discriminator with Adam optimizer with a step size of 0.001. We set the threshold percentile k = 10. The Q-network consists of two fully connected layers with 400 units with a batch normalization and a Re LU in between. We trained it with the Adam optimizer using a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. We set the threshold percentile k = 4. We set the Lagrange multiplier η to 1.0. The eigenfunction is learned with a convolutional neural network with 2 convolution layers (32 8x8 filters with stride 4 and 64 4x4 filters with stride 2) and a fully-connected hidden layer with 400 units and Re LU in between. We trained it with Adam optimizer with a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. |