Hierarchical Reinforcement Learning by Discovering Intrinsic Options
Authors: Jesse Zhang, Haonan Yu, Wei Xu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate success rate and sample efficiency across two environment suites, as shown in Figure 2. Important details are presented here with more information in appendix Section B. |
| Researcher Affiliation | Collaboration | Jesse Zhang 1, Haonan Yu 2, Wei Xu2 1University of Southern California, 2Horizon Robotics |
| Pseudocode | Yes | A PSEUDO CODE FOR HIDIO Algorithm 1: Hierarchical RL with Intrinsic Options Discovery |
| Open Source Code | Yes | Code available at https://www.github.com/jesbu1/hidio. |
| Open Datasets | Yes | The first suite consists of two 7-DOF reaching and pushing environments evaluated in Chua et al. (2018). ... We also propose another suite of environments called SOCIALROBOT 3. We construct two sparse reward robotic navigation and manipulation tasks, GOALTASK and KICKBALL. ... Code available at https://www.github.com/Horizon Robotics/Social Robot |
| Dataset Splits | No | The paper describes training parameters and evaluation intervals for continuous interaction with environments, but does not specify explicit training/validation/test dataset splits in terms of percentages or counts from a pre-defined static dataset, which is typical for supervised learning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper states "We implement HIDIO based on an RL framework called ALF", but does not provide specific version numbers for ALF or any other software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | Number of parallel actors/environments per rollout: 20... Steps per episode: 100... Batch size: 2048... Learning rate: 10−4 for all network modules... Policy/Q network hidden layers: (256, 256, 256) with Re LU non-linearities... Polyak averaging coefficient for target Q: 0.999... Training batches per iteration: 100 |