Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Induction and Exploitation of Subgoal Automata for Reinforcement Learning

Authors: Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, Alessandra Russo

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ISA in several gridworld and continuous state space problems using different RL algorithms that leverage the automaton structures. We provide an in-depth empirical analysis of the automaton learning performance in terms of the traces, the symmetry breaking and specific restrictions imposed on the final learnable automaton.
Researcher Affiliation Academia Daniel Furelos-Blanco EMAIL Mark Law EMAIL Department of Computing Imperial College London London, SW7 2AZ, United Kingdom Anders Jonsson EMAIL Department of Information and Communication Technologies Universitat Pompeu Fabra Roc Boronat 138, 08018 Barcelona, Spain Krysia Broda EMAIL Alessandra Russo EMAIL Department of Computing Imperial College London London, SW7 2AZ, United Kingdom
Pseudocode Yes Algorithm 1 ISA Algorithm
Open Source Code Yes The code is available at https://github.com/ertsiger/induction-subgoal-automata-rl.
Open Datasets Yes Our analysis focuses on evaluating how the behavior of the RL agent and the task being learned affect automaton learning and vice versa. Firstly, we describe the main characteristics of our evaluation methodology. Secondly, we make a thorough analysis of the performance of our approach using the Office World (Toro Icarte et al., 2018), Craft World (Andreas et al., 2017) and Water World (Toro Icarte et al., 2018) domains.
Dataset Splits No The sets of POMDPs used in these experiments are not handcrafted, but randomly generated (e.g., placing observables randomly in a grid).
Hardware Specification Yes All experiments ran on 3.40GHz Intel R Core TM i7-6700 processors.
Software Dependencies No We use ILASP2 to learn the automata with a 2 hour timeout for each automaton learning task. ... we use a Double DQN (DDQN, van Hasselt et al., 2016) to approximate the Q-functions in both HRL and QRM. ... We train the neural networks using the Adam optimizer (Kingma & Ba, 2015) with α = 1 10 5. No specific software versions for libraries or frameworks (e.g., Python, TensorFlow, PyTorch) are provided.
Experiment Setup Yes Table 1: Parameters used in the Office World experiments. Learning rate (α) 0.1 Exploration rate (ϵ) 0.1 Discount factor (γ) 0.99 Number of episodes 10,000 Maximum episode length (N) 250 Number of disjuncts (κ) 1