Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Temporal Abstraction in Reinforcement Learning with the Successor Representation
Authors: Marlos C. Machado, Andre Barreto, Doina Precup, Michael Bowling
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard. ... We perform numerical simulations to assess how effective options discovered by different methods are in capturing environment properties. |
| Researcher Affiliation | Collaboration | Marlos C. Machado EMAIL Deep Mind Alberta Machine Intelligence Institute (Amii) Department of Computing Science, University of Alberta Edmonton, AB, Canada Andre Barreto EMAIL Deep Mind London, United Kingdom Doina Precup EMAIL Deep Mind Quebec AI Institute (Mila) School of Computer Science, Mc Gill University Montreal, QC, Canada Michael Bowling EMAIL Deep Mind Alberta Machine Intelligence Institute (Amii) Department of Computing Science, University of Alberta Edmonton, AB, Canada |
| Pseudocode | Yes | Algorithm 1 depicts an implementation of the SR. ... Algorithm 2, in the next page, depicts the pseudo-code for CEO. ... See Algorithm 3 for a presentation of this discussion in pseudo-code. ... Algorithm 4: OK-Eigenoptions ... Algorithm 5 and 6 summarize eigenoption discovery. ... Algorithm 7 and 8, in Appendix C, summarize the presentation of covering options when computed in both closed-form and online. |
| Open Source Code | No | The text does not explicitly state that the authors' implementation code is open-sourced or provide a link to a code repository for the methodology described in this paper. |
| Open Datasets | Yes | We use the four-room domain (Sutton et al., 1999), which we implemented with Gym-Minigrid (Chevalier-Boisvert et al., 2018). |
| Dataset Splits | No | The paper describes experiments conducted in simulated environments (e.g., 'four-room domain', 'open-room gridworld') where agents interact dynamically. It does not mention pre-collected datasets with explicit training, validation, or test splits, as data is generated through interaction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions implementing the four-room domain with Gym-Minigrid, but it does not specify version numbers for this or any other software dependencies used in their experiments. |
| Experiment Setup | Yes | The Q-learning parameters we use are α = 0.1, γ = 0.9, and ϵ = 0.05. We use η = αo = 0.1, γSR = γo = 0.99, and we sample options with 5% probability (poption), which is similar to what we did in Section 6.4, where options were potentially sampled only in the exploration step of Q-Learning with ϵ-greedy (ϵ = 0.05). We pass over D 100 times when learning the SR, and 1,000 when learning the option policy, leveraging the off-policy aspect of our problem formulation. |