Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies

Authors: Kenneth Marino, Abhinav Gupta, Rob Fergus, Arthur Szlam

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we test on a variety of difficult sparse reward problems simulated through Mujoco (Todorov et al., 2012). We use two popular and challenging agents: Ant and Humanoid.
Researcher Affiliation Collaboration Kenneth Marino & Abhinav Gupta Carnegie Mellon University, Facebook AI Research {kdmarino,abhinavg}@cs.cmu.edu Rob Fergus New York University, Facebook AI Research fergus@cs.nyu.edu Arthur Szlam Facebook AI Research aszlam@fb.com
Pseudocode Yes Algorithm 1 Our method
Open Source Code No The paper provides a link to a project page (https://sites.google.com/view/hrl-ep3) which typically hosts supplementary materials and videos, but it does not explicitly state that the source code for the methodology is available at this link or elsewhere.
Open Datasets Yes In our experiments, we test on a variety of difficult sparse reward problems simulated through Mujoco (Todorov et al., 2012). We use two popular and challenging agents: Ant and Humanoid. ... We compare our method to baselines similar to those used in Haarnoja et al. (2018a), all trained with PPO as is our method. The baseline models are either trained with or without the phase conditioning, and either from scratch, or finetuned (meaning that we initialize the network using a network trained on our low-level objective). We also give some of the baselines more information by also giving them a velocity reward during high-level training (meaning they are rewarded for movement of the agent).
Dataset Splits No The paper conducts experiments in reinforcement learning environments but does not specify train/validation/test splits for a static dataset, which is common in interactive simulation settings.
Hardware Specification No The paper mentions 'running serially on CPU' when discussing comparison to other methods but does not provide specific details on the hardware used, such as CPU model, number of cores, or GPU specifications.
Software Dependencies No The paper mentions using implementations from Kostrikov (2018) for RL algorithms (PPO, A2C) and its own DQN implementation, as well as the ADAM optimizer. However, it does not provide specific version numbers for these software components or any underlying libraries like PyTorch or TensorFlow.
Experiment Setup Yes The hyperparameters for these three algorithms are shown in Tables 1, 2 and 3 We use the ADAM (Kingma & Ba, 2014) optimizer. ... During low-level training we train 80 policies using different random seeds. ... For our Ant models, we use a 3-layer MLP with tanh activation functions and a hidden size of 32. For Humanoid we add skip connections between layers and decrease the hidden size to 16. ... We choose the cyclic constraint multipliers for state (λs) and action (λa) to be 0.05 and 0.01 respectively.