Stochastic Neural Networks for Hierarchical Reinforcement Learning
Authors: Carlos Florensa, Yan Duan, Pieter Abbeel
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments1 show2 that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks. |
| Researcher Affiliation | Collaboration | Carlos Florensa , Yan Duan , Pieter Abbeel UC Berkeley, Department of Electrical Engineering and Computer Science International Computer Science Institute Open AI florensa@berkeley.edu, {rocky,pieter}@openai.com |
| Pseudocode | Yes | Algorithm 1: Skill training for SNNs with MI bonus |
| Open Source Code | Yes | 1Code available at: https://github.com/florensacc/snn4hrl |
| Open Datasets | Yes | We have applied our framework to the two hierarchical tasks described in the benchmark by Duan et al. (2016): Locomotion + Maze and Locomotion + Food Collection (Gather). |
| Dataset Splits | No | The paper describes pre-training and downstream tasks, and uses terms like 'batch size' and 'maximum path length' related to online data collection in reinforcement learning, but does not provide explicit training, validation, and test dataset splits in the traditional sense of fixed data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions 'TRPO' as the policy optimization algorithm but does not provide specific software dependencies like library names with version numbers (e.g., 'Python 3.x', 'PyTorch x.x') that are needed to replicate the experiment. |
| Experiment Setup | Yes | All policies are trained with TRPO with step size 0.01 and discount 0.99. All neural networks (each of the Multi-policy ones, the SNN and the Manager Network) have 2 layers of 32 hidden units. For the SNN training, the mesh density used to grid the (x, y) space and give the MI bonus is 10 divisions/unit. The number of skills trained (ie dimension of latent variable in the SNN or number of independently trained policies in the Mulit-policy setup) is 6. The batch size and the maximum path length for the pre-train task are also the ones used in the benchmark (Duan et al., 2016): 50,000 and 500 respectively. For the downstream tasks, see Tab. 1. |