Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes

Authors: Guillermo Infante, Anders Jonsson, Vicenç Gómez6970-6977

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze experimentally our proposed learning algorithm and show in two classical domains that it is more sample efficient compared to a flat learner and similar hierarchical approaches when the set of boundary states is smaller than the entire state space.
Researcher Affiliation Academia Dept. Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona (Spain) {guillermo.infante,anders.jonsson,vicen.gomez}@upf.edu
Pseudocode Yes Algorithm Online and Intra-Task Learning Algorithm
Open Source Code Yes 1Code available at https://github.com/guillermoim/HRL LMDP
Open Datasets Yes Rooms Domain. We analyze the performance for different room sizes and number of rooms (Figure 2). ... Taxi Domain. To allow comparison between all the methods, we adapted the Taxi domain as follows: when the taxi is at the correct pickup location, it can transition to a state with the passenger in the taxi. In a wrong pickup location, it can instead transition to a terminal state with large negative reward (simulating an unsuccessful pick-up). When the passenger is in the taxi, it can be dropped off at any pickup location, successfully completing the task whenever dropped at the correct destination.
Dataset Splits No The paper describes online learning in simulated environments (Rooms and Taxi domains) and evaluates performance based on episodes and samples, but does not specify explicit train/validation/test dataset splits with percentages or counts.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instances) are provided for the experimental setup.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1').
Experiment Setup Yes In all experiments, the learning rates for each abstraction level is αℓ(t) = cℓ/(cℓ+ n) where n represents the episode each sample t belongs to. We empirically optimize the constant cℓfor each domain. For LMDPs, we use a temperature λ = 1, which provides good results.