Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes
Authors: Guillermo Infante, Anders Jonsson, Vicenç Gómez6970-6977
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze experimentally our proposed learning algorithm and show in two classical domains that it is more sample efficient compared to a flat learner and similar hierarchical approaches when the set of boundary states is smaller than the entire state space. |
| Researcher Affiliation | Academia | Dept. Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona (Spain) {guillermo.infante,anders.jonsson,vicen.gomez}@upf.edu |
| Pseudocode | Yes | Algorithm Online and Intra-Task Learning Algorithm |
| Open Source Code | Yes | 1Code available at https://github.com/guillermoim/HRL LMDP |
| Open Datasets | Yes | Rooms Domain. We analyze the performance for different room sizes and number of rooms (Figure 2). ... Taxi Domain. To allow comparison between all the methods, we adapted the Taxi domain as follows: when the taxi is at the correct pickup location, it can transition to a state with the passenger in the taxi. In a wrong pickup location, it can instead transition to a terminal state with large negative reward (simulating an unsuccessful pick-up). When the passenger is in the taxi, it can be dropped off at any pickup location, successfully completing the task whenever dropped at the correct destination. |
| Dataset Splits | No | The paper describes online learning in simulated environments (Rooms and Taxi domains) and evaluates performance based on episodes and samples, but does not specify explicit train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instances) are provided for the experimental setup. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1'). |
| Experiment Setup | Yes | In all experiments, the learning rates for each abstraction level is αℓ(t) = cℓ/(cℓ+ n) where n represents the episode each sample t belongs to. We empirically optimize the constant cℓfor each domain. For LMDPs, we use a temperature λ = 1, which provides good results. |