Sub-policy Adaptation for Hierarchical Reinforcement Learning
Authors: Alexander Li, Carlos Florensa, Ignasi Clavera, Pieter Abbeel
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS We designed our experiments to answer the following questions: 1) How does Hi PPO compare against a flat policy when learning from scratch? 2) Does it lead to policies more robust to environment changes? 3) How well does it adapt already learned skills? and 4) Does our skill diversity assumption hold in practice? |
| Researcher Affiliation | Academia | Alexander C. Li , Carlos Florensa , Ignasi Clavera, Pieter Abbeel University of California, Berkeley {alexli1, florensa, iclavera, pabbeel}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Hi PPO Rollout and Algorithm 2 Hi PPO |
| Open Source Code | Yes | Code and videos are available. 1sites.google.com/view/hippo-rl Equal Contribution |
| Open Datasets | No | We evaluate our approach on a variety of robotic locomotion and navigation tasks. The Block environments, depicted in Fig. 2a-2b, have walls of random heights at regular intervals, and the objective is to learn a gait for the Hopper and Half-Cheetah robots to jump over them... The Gather environments, described by Duan et al. (2016), require agents to collect apples (green balls, +1 reward) while avoiding bombs (red balls, -1 reward)... All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012). |
| Dataset Splits | No | The Block environments used a horizon of 1000 and a batch size of 50,000, while Gather used a batch size of 100,000. |
| Hardware Specification | No | All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012). |
| Software Dependencies | No | The learning rate, clipping parameter, and number of gradient updates come from the Open AI Baselines implementation. and All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012). |
| Experiment Setup | Yes | The Block environments used a horizon of 1000 and a batch size of 50,000, while Gather used a batch size of 100,000. Ant Gather has a horizon of 5000, while Snake Gather has a horizon of 8000 due to its larger size. For all experiments, both PPO and Hi PPO used learning rate 3 10 3, clipping parameter ϵ = 0.1, 10 gradient updates per iteration, and discount γ = 0.999. |