META LEARNING SHARED HIERARCHIES

Authors: Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach on a wide range of environments, including 2D continuous movement, gridworld navigation, and 3D physics tasks involving the directional movement of robots. In the 3D environments, we enable humanoid robots to both walk and crawl with the same policy; and 4-legged robots to discover directional movement primitives to solve a distribution of mazes as well as sparse-reward obstacle courses. Our experiments show that our method is capable of learning meaningful sub-policies solely through interaction with a distributions of tasks, outperforming previously proposed algorithms.
Researcher Affiliation Collaboration Kevin Frans Henry M. Gunn High School Work done as an intern at Open AI kevinfrans2@gmail.com Jonathan Ho, Xi Chen, Pieter Abbeel UC Berkeley, Department of Electrical Engineering and Computer Science John Schulman Open AI
Pseudocode Yes Algorithm 1 Meta Learning Shared Hierarchies
Open Source Code No The paper provides a link for supplemental videos ('Videos at https://sites.google.com/site/mlshsupplementals') but does not state that source code is provided or link to a code repository.
Open Datasets No The paper describes various environments/tasks used for experiments ('2D moving bandits task', 'four-rooms domain', 'Ant Twowalk', 'Walk/Crawl task', 'Ant Obstacle course task') which appear to be custom-built or based on existing domains without explicit public dataset access information (links, DOIs, formal citations for public datasets).
Dataset Splits No The paper describes 'warmup period' and 'joint update period' for optimizing policies, but these refer to training phases and not formal training/validation/test splits of a dataset with explicit percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions 'PPO (Schulman et al., 2017)' as the policy gradient method and 'Mujoco (Todorov et al., 2012)' for simulation, but it does not specify version numbers for these or any other software components.
Experiment Setup Yes For both master and sub-policies, we use 2 layer MLPs with a hidden size of 64. Master policy actions are sampled through a softmax distribution. We train both master and sub-policies using policy gradient methods, specifically PPO (Schulman et al., 2017). For collecting experience, we compute a batchsize of D=2000 timesteps. We use a much larger learning rate for θ (0.01) than for φ (0.0003), since φ parameters should remain relatively consistent throughout a single warmup and joint-update period. Warmup and joint-update lengths for individual environment distributions will be described in the following section.