Stochastic Neural Networks for Hierarchical Reinforcement Learning

Authors: Carlos Florensa, Yan Duan, Pieter Abbeel

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments1 show2 that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.
Researcher Affiliation Collaboration Carlos Florensa , Yan Duan , Pieter Abbeel UC Berkeley, Department of Electrical Engineering and Computer Science International Computer Science Institute Open AI florensa@berkeley.edu, {rocky,pieter}@openai.com
Pseudocode Yes Algorithm 1: Skill training for SNNs with MI bonus
Open Source Code Yes 1Code available at: https://github.com/florensacc/snn4hrl
Open Datasets Yes We have applied our framework to the two hierarchical tasks described in the benchmark by Duan et al. (2016): Locomotion + Maze and Locomotion + Food Collection (Gather).
Dataset Splits No The paper describes pre-training and downstream tasks, and uses terms like 'batch size' and 'maximum path length' related to online data collection in reinforcement learning, but does not provide explicit training, validation, and test dataset splits in the traditional sense of fixed data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions 'TRPO' as the policy optimization algorithm but does not provide specific software dependencies like library names with version numbers (e.g., 'Python 3.x', 'PyTorch x.x') that are needed to replicate the experiment.
Experiment Setup Yes All policies are trained with TRPO with step size 0.01 and discount 0.99. All neural networks (each of the Multi-policy ones, the SNN and the Manager Network) have 2 layers of 32 hidden units. For the SNN training, the mesh density used to grid the (x, y) space and give the MI bonus is 10 divisions/unit. The number of skills trained (ie dimension of latent variable in the SNN or number of independently trained policies in the Mulit-policy setup) is 6. The batch size and the maximum path length for the pre-train task are also the ones used in the benchmark (Duan et al., 2016): 50,000 and 500 respectively. For the downstream tasks, see Tab. 1.