Hierarchical Reinforcement Learning by Discovering Intrinsic Options

Authors: Jesse Zhang, Haonan Yu, Wei Xu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate success rate and sample efficiency across two environment suites, as shown in Figure 2. Important details are presented here with more information in appendix Section B.
Researcher Affiliation Collaboration Jesse Zhang 1, Haonan Yu 2, Wei Xu2 1University of Southern California, 2Horizon Robotics
Pseudocode Yes A PSEUDO CODE FOR HIDIO Algorithm 1: Hierarchical RL with Intrinsic Options Discovery
Open Source Code Yes Code available at https://www.github.com/jesbu1/hidio.
Open Datasets Yes The first suite consists of two 7-DOF reaching and pushing environments evaluated in Chua et al. (2018). ... We also propose another suite of environments called SOCIALROBOT 3. We construct two sparse reward robotic navigation and manipulation tasks, GOALTASK and KICKBALL. ... Code available at https://www.github.com/Horizon Robotics/Social Robot
Dataset Splits No The paper describes training parameters and evaluation intervals for continuous interaction with environments, but does not specify explicit training/validation/test dataset splits in terms of percentages or counts from a pre-defined static dataset, which is typical for supervised learning.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies No The paper states "We implement HIDIO based on an RL framework called ALF", but does not provide specific version numbers for ALF or any other software dependencies like programming languages or libraries.
Experiment Setup Yes Number of parallel actors/environments per rollout: 20... Steps per episode: 100... Batch size: 2048... Learning rate: 10−4 for all network modules... Policy/Q network hidden layers: (256, 256, 256) with Re LU non-linearities... Polyak averaging coefficient for target Q: 0.999... Training batches per iteration: 100