Learning Multi-Level Hierarchies with Hindsight

Authors: Andrew Levy, George Konidaris, Robert Platt, Kate Saenko

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate experimentally in both grid world and simulated robotics domains that our approach can significantly accelerate learning relative to other non-hierarchical and hierarchical methods.
Researcher Affiliation Academia Andrew Levy Department of Computer Science Brown University Providence, RI, USA andrew_levy2@brown.edu George Konidaris Department of Computer Science Brown University Providence, RI, USA gdk@cs.brown.edu Robert Platt College of Computer and Information Science Northeastern University Boston, MA, USA rplatt@ccs.neu.edu Kate Saenko Department of Computer Science Boston University Boston, MA, USA saenko@bu.edu
Pseudocode Yes Algorithm 1 Hierarchical Actor-Critic (HAC) ... Algorithm 2 Hierarchical Q-Learning (Hier Q)
Open Source Code Yes For further detail, see the GitHub repository available at https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-.
Open Datasets Yes We evaluated our approach on both grid world tasks and more complex simulated robotics environments. The continuous tasks consisted of the following simulated robotics environments developed in Mu Jo Co (Todorov et al., 2012): (i) inverted pendulum, (ii) UR5 reacher, (iii) ant reacher, and (iv) ant four rooms.
Dataset Splits No The paper does not provide specific training/validation/test dataset splits. It mentions using DDPG and HER, which often involve such splits, but does not specify the percentages or counts for the datasets or environments used.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software like 'Q-learning', 'DDPG', 'HER', and 'Mu Jo Co' but does not provide specific version numbers for these or other software libraries/dependencies.
Experiment Setup Yes DDPG Parameters: Bounded Q-Values: We bound the output of each critic function to the range [ H, 0] using a negative sigmoid function... DDPG Target Networks: ...we removed the target networks used in DDPG... Exploration: Each level uses the following exploration strategy... 20% of actions are sampled uniformly at random... 80% of actions are the sum of actions sampled from the level s policy and Gaussian noise Neural Network Architectures: All actor and critic neural networks had 3 hidden layers, with 64 nodes in each hidden layer. Re LU activation functions were used. HAC Parameters: Maximum horizon of a subgoal, H: 1. For k=3-level agents in Mu Jo Co tasks, H = 10; 2. For k=2-level agents in Mu Jo Co tasks, H was generally in the range [20,30] Subgoal testing rate λ = 0.3 Goal and subgoal achievement thresholds were hand-crafted.