Discovering Options for Exploration by Minimizing Cover Time

Authors: Yuu Jinnai, Jee Won Park, David Abel, George Konidaris

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that the proposed algorithm improves learning in several domains with sparse rewards.
Researcher Affiliation Academia 1Brown University, Providence, RI, United States. Correspondence to: Yuu Jinnai <yuu jinnai@brown.edu>.
Pseudocode Yes 1. Compute the second smallest eigenvalue and its corresponding eigenvector (i.e., the Fiedler vector) of the Laplacian of the state transition graph G. 2. Let vi and vj be the state with largest and smallest value in the eigenvector respectively. Generate two point options; one with I = {vi} and β = {vj} and the other one with I = {vj} and β = {vi}. Each option policy is the optimal path from the initial state to the termination state. 3. Set G G {(vi, vj)} and repeat the process until the number of options reaches k.
Open Source Code No The paper does not provide any specific links to source code or explicitly state that the code for the methodology is publicly available.
Open Datasets Yes We used six MDPs in our empirical study: a 9x9 grid, a four-room gridworld, Taxi, Towers of Hanoi, Parr s maze, and Race Track. and Parr s maze (Parr & Russell, 1998) and Taxi (Dietterich, 2000). These are well-known, established benchmarks in RL, implying public availability through their foundational papers.
Dataset Splits No The paper uses reinforcement learning environments where data is generated through interaction, rather than pre-defined datasets with explicit train/validation/test splits. Therefore, information on such splits is not applicable in the typical sense.
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'Q-learning' with specific hyperparameters but does not list any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes We used Q-learning (Watkins & Dayan, 1992) (α = 0.1, γ = 0.95) for 100 episodes, 100 steps for 9x9 grid, 500 steps fourroom, Hanoi, and Taxi. and The agents learned for 100 episodes, and episodes were 10,000 steps long for Parr s maze and 100 steps for the Towers of Hanoi and Taxi. We used Q-learning (Watkins & Dayan, 1992) (α = 0.1, γ = 0.95).