Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Exploration in Reinforcement Learning with Deep Covering Options
Authors: Yuu Jinnai, Jee Won Park, Marlos C. Machado, George Konidaris
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward. |
| Researcher Affiliation | Collaboration | Yuu Jinnai Brown University Jee Won Park Brown University Marlos C. Machado Google Brain George Konidaris Brown University |
| Pseudocode | Yes | Algorithm 1 Deep covering options |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the release of its source code. |
| Open Datasets | Yes | Pinball In the Pinball domain the goal is to maneuver a small ball from a start state to a goal state (Figure 3a; Konidaris and Barto, 2009). Mu Jo Co control tasks (Todorov et al., 2012). three Atari games (Bellemare et al., 2013). |
| Dataset Splits | No | The paper describes training procedures and sampling trajectories but does not specify explicit train/validation/test dataset splits with percentages or sample counts. For example, 'We evaluated with the threshold percentile k = {5, 10, 30, 50} and selected 30 as it performed the best (Table 1).' describes parameter tuning, not a dataset split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components and algorithms (e.g., Q-learning, Fourier basis, DIAYN, DDPG, Double Deep Q-learning, Adam optimizer, MuJoCo, Arcade Learning Environment) but does not provide specific version numbers for these software dependencies or libraries. |
| Experiment Setup | Yes | We used Q-learning (α = 0.1, γ = 0.99, ϵ = 0.05). We set the percentile to k = 30. We set the Lagrange multiplier η to 1.0. The actor and the critic are implemented with 3 hidden layers with 256 units each followed by a Re LU activation function with Adam optimizer with a step size of 0.005. We used neural networks for the actor and the critic as it outperformed an agent with the actor and the critic implemented by a linear approximator using 3rd order Fourier basis Konidaris et al. (2011). Our discriminator network consists of 2 hidden layers with 256 units each followed by a Re LU activation function. We trained the discriminator with Adam optimizer with a step size of 0.001. We set the threshold percentile k = 10. The Q-network consists of two fully connected layers with 400 units with a batch normalization and a Re LU in between. We trained it with the Adam optimizer using a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. We set the threshold percentile k = 4. We set the Lagrange multiplier η to 1.0. The eigenfunction is learned with a convolutional neural network with 2 convolution layers (32 8x8 filters with stride 4 and 64 4x4 filters with stride 2) and a fully-connected hidden layer with 400 units and Re LU in between. We trained it with Adam optimizer with a step size of 0.0001 and a batch size of 64. We updated a target policy every step by update rate of 0.001. We set ϵ to 0.05. |