Data-Efficient Hierarchical Reinforcement Learning

Authors: Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, Sergey Levine

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations,1 learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.2
Researcher Affiliation Collaboration Ofir Nachum Google Brain ofirnachum@google.com Shixiang Gu Google Brain shanegu@google.com Honglak Lee Google Brain honglak@google.com Sergey Levine Google Brain slevine@google.com Also at University of Cambridge; Max Planck Institute of Intelligent Systems. Also at UC Berkeley.
Pseudocode No The paper describes the HIRO framework and training process in text and mathematical equations in Section 3, but does not provide structured pseudocode or an algorithm block.
Open Source Code Yes 2Find open-source code at https://github.com/tensorflow/models/tree/master/research/ efficient-hrl
Open Datasets Yes Ant Gather. The ant gather task is a standard task introduced in [9]. ... Ant Maze. For the first difficult navigation task we adapted the maze environment introduced in [9].
Dataset Splits No The paper mentions training for "10M steps" and evaluating "performance of the best policy obtained" but does not specify fixed train/validation/test dataset splits, which are not typical for reinforcement learning environments where data is generated dynamically.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU models, or specific cloud instances used for running experiments.
Software Dependencies No The paper mentions using the "TD3 learning algorithm [12]" and "DDPG algorithm [25]" but does not provide specific software versions or dependencies like Python, TensorFlow, or PyTorch versions.
Experiment Setup Yes The higher-level policy observes the state and produces a high-level action (or goal) gt Rds by either sampling from its policy gt µhi when t 0 (mod c), or otherwise using a fixed goal transition function gt = h(st 1, gt 1, st)...In our implementation, we calculate the quantity on eight candidate goals sampled randomly from a Gaussian centered at st+c st. We also include the original goal gt and a goal corresponding to the difference st+c st in the candidate set, to have a total of 10 candidates.