reproducibilityindex.ai

Data-Efficient Hierarchical Reinforcement Learning

Authors: Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, Sergey Levine

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations,1 learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.2
Researcher Affiliation	Collaboration	Oﬁr Nachum Google Brain ofirnachum@google.com Shixiang Gu Google Brain shanegu@google.com Honglak Lee Google Brain honglak@google.com Sergey Levine Google Brain slevine@google.com Also at University of Cambridge; Max Planck Institute of Intelligent Systems. Also at UC Berkeley.
Pseudocode	No	The paper describes the HIRO framework and training process in text and mathematical equations in Section 3, but does not provide structured pseudocode or an algorithm block.
Open Source Code	Yes	2Find open-source code at https://github.com/tensorflow/models/tree/master/research/ efficient-hrl
Open Datasets	Yes	Ant Gather. The ant gather task is a standard task introduced in [9]. ... Ant Maze. For the ﬁrst difﬁcult navigation task we adapted the maze environment introduced in [9].
Dataset Splits	No	The paper mentions training for "10M steps" and evaluating "performance of the best policy obtained" but does not specify fixed train/validation/test dataset splits, which are not typical for reinforcement learning environments where data is generated dynamically.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU models, CPU models, or specific cloud instances used for running experiments.
Software Dependencies	No	The paper mentions using the "TD3 learning algorithm [12]" and "DDPG algorithm [25]" but does not provide specific software versions or dependencies like Python, TensorFlow, or PyTorch versions.
Experiment Setup	Yes	The higher-level policy observes the state and produces a high-level action (or goal) gt Rds by either sampling from its policy gt µhi when t 0 (mod c), or otherwise using a ﬁxed goal transition function gt = h(st 1, gt 1, st)...In our implementation, we calculate the quantity on eight candidate goals sampled randomly from a Gaussian centered at st+c st. We also include the original goal gt and a goal corresponding to the difference st+c st in the candidate set, to have a total of 10 candidates.