Data-Efficient Hierarchical Reinforcement Learning
Authors: Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, Sergey Levine
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations,1 learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.2 |
| Researcher Affiliation | Collaboration | Ofir Nachum Google Brain ofirnachum@google.com Shixiang Gu Google Brain shanegu@google.com Honglak Lee Google Brain honglak@google.com Sergey Levine Google Brain slevine@google.com Also at University of Cambridge; Max Planck Institute of Intelligent Systems. Also at UC Berkeley. |
| Pseudocode | No | The paper describes the HIRO framework and training process in text and mathematical equations in Section 3, but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | 2Find open-source code at https://github.com/tensorflow/models/tree/master/research/ efficient-hrl |
| Open Datasets | Yes | Ant Gather. The ant gather task is a standard task introduced in [9]. ... Ant Maze. For the first difficult navigation task we adapted the maze environment introduced in [9]. |
| Dataset Splits | No | The paper mentions training for "10M steps" and evaluating "performance of the best policy obtained" but does not specify fixed train/validation/test dataset splits, which are not typical for reinforcement learning environments where data is generated dynamically. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU models, or specific cloud instances used for running experiments. |
| Software Dependencies | No | The paper mentions using the "TD3 learning algorithm [12]" and "DDPG algorithm [25]" but does not provide specific software versions or dependencies like Python, TensorFlow, or PyTorch versions. |
| Experiment Setup | Yes | The higher-level policy observes the state and produces a high-level action (or goal) gt Rds by either sampling from its policy gt µhi when t 0 (mod c), or otherwise using a fixed goal transition function gt = h(st 1, gt 1, st)...In our implementation, we calculate the quantity on eight candidate goals sampled randomly from a Gaussian centered at st+c st. We also include the original goal gt and a goal corresponding to the difference st+c st in the candidate set, to have a total of 10 candidates. |