META LEARNING SHARED HIERARCHIES
Authors: Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on a wide range of environments, including 2D continuous movement, gridworld navigation, and 3D physics tasks involving the directional movement of robots. In the 3D environments, we enable humanoid robots to both walk and crawl with the same policy; and 4-legged robots to discover directional movement primitives to solve a distribution of mazes as well as sparse-reward obstacle courses. Our experiments show that our method is capable of learning meaningful sub-policies solely through interaction with a distributions of tasks, outperforming previously proposed algorithms. |
| Researcher Affiliation | Collaboration | Kevin Frans Henry M. Gunn High School Work done as an intern at Open AI kevinfrans2@gmail.com Jonathan Ho, Xi Chen, Pieter Abbeel UC Berkeley, Department of Electrical Engineering and Computer Science John Schulman Open AI |
| Pseudocode | Yes | Algorithm 1 Meta Learning Shared Hierarchies |
| Open Source Code | No | The paper provides a link for supplemental videos ('Videos at https://sites.google.com/site/mlshsupplementals') but does not state that source code is provided or link to a code repository. |
| Open Datasets | No | The paper describes various environments/tasks used for experiments ('2D moving bandits task', 'four-rooms domain', 'Ant Twowalk', 'Walk/Crawl task', 'Ant Obstacle course task') which appear to be custom-built or based on existing domains without explicit public dataset access information (links, DOIs, formal citations for public datasets). |
| Dataset Splits | No | The paper describes 'warmup period' and 'joint update period' for optimizing policies, but these refer to training phases and not formal training/validation/test splits of a dataset with explicit percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'PPO (Schulman et al., 2017)' as the policy gradient method and 'Mujoco (Todorov et al., 2012)' for simulation, but it does not specify version numbers for these or any other software components. |
| Experiment Setup | Yes | For both master and sub-policies, we use 2 layer MLPs with a hidden size of 64. Master policy actions are sampled through a softmax distribution. We train both master and sub-policies using policy gradient methods, specifically PPO (Schulman et al., 2017). For collecting experience, we compute a batchsize of D=2000 timesteps. We use a much larger learning rate for θ (0.01) than for φ (0.0003), since φ parameters should remain relatively consistent throughout a single warmup and joint-update period. Warmup and joint-update lengths for individual environment distributions will be described in the following section. |