reproducibilityindex.ai

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Authors: Alexander Li, Carlos Florensa, Ignasi Clavera, Pieter Abbeel

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS We designed our experiments to answer the following questions: 1) How does Hi PPO compare against a ﬂat policy when learning from scratch? 2) Does it lead to policies more robust to environment changes? 3) How well does it adapt already learned skills? and 4) Does our skill diversity assumption hold in practice?
Researcher Affiliation	Academia	Alexander C. Li , Carlos Florensa , Ignasi Clavera, Pieter Abbeel University of California, Berkeley {alexli1, florensa, iclavera, pabbeel}@berkeley.edu
Pseudocode	Yes	Algorithm 1 Hi PPO Rollout and Algorithm 2 Hi PPO
Open Source Code	Yes	Code and videos are available. 1sites.google.com/view/hippo-rl Equal Contribution
Open Datasets	No	We evaluate our approach on a variety of robotic locomotion and navigation tasks. The Block environments, depicted in Fig. 2a-2b, have walls of random heights at regular intervals, and the objective is to learn a gait for the Hopper and Half-Cheetah robots to jump over them... The Gather environments, described by Duan et al. (2016), require agents to collect apples (green balls, +1 reward) while avoiding bombs (red balls, -1 reward)... All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012).
Dataset Splits	No	The Block environments used a horizon of 1000 and a batch size of 50,000, while Gather used a batch size of 100,000.
Hardware Specification	No	All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012).
Software Dependencies	No	The learning rate, clipping parameter, and number of gradient updates come from the Open AI Baselines implementation. and All environments are simulated with the physics engine Mu Jo Co (Todorov et al., 2012).
Experiment Setup	Yes	The Block environments used a horizon of 1000 and a batch size of 50,000, while Gather used a batch size of 100,000. Ant Gather has a horizon of 5000, while Snake Gather has a horizon of 8000 due to its larger size. For all experiments, both PPO and Hi PPO used learning rate 3 10 3, clipping parameter ϵ = 0.1, 10 gradient updates per iteration, and discount γ = 0.999.