Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies
Authors: Kenneth Marino, Abhinav Gupta, Rob Fergus, Arthur Szlam
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we test on a variety of difficult sparse reward problems simulated through Mujoco (Todorov et al., 2012). We use two popular and challenging agents: Ant and Humanoid. |
| Researcher Affiliation | Collaboration | Kenneth Marino & Abhinav Gupta Carnegie Mellon University, Facebook AI Research {kdmarino,abhinavg}@cs.cmu.edu Rob Fergus New York University, Facebook AI Research fergus@cs.nyu.edu Arthur Szlam Facebook AI Research aszlam@fb.com |
| Pseudocode | Yes | Algorithm 1 Our method |
| Open Source Code | No | The paper provides a link to a project page (https://sites.google.com/view/hrl-ep3) which typically hosts supplementary materials and videos, but it does not explicitly state that the source code for the methodology is available at this link or elsewhere. |
| Open Datasets | Yes | In our experiments, we test on a variety of difficult sparse reward problems simulated through Mujoco (Todorov et al., 2012). We use two popular and challenging agents: Ant and Humanoid. ... We compare our method to baselines similar to those used in Haarnoja et al. (2018a), all trained with PPO as is our method. The baseline models are either trained with or without the phase conditioning, and either from scratch, or finetuned (meaning that we initialize the network using a network trained on our low-level objective). We also give some of the baselines more information by also giving them a velocity reward during high-level training (meaning they are rewarded for movement of the agent). |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments but does not specify train/validation/test splits for a static dataset, which is common in interactive simulation settings. |
| Hardware Specification | No | The paper mentions 'running serially on CPU' when discussing comparison to other methods but does not provide specific details on the hardware used, such as CPU model, number of cores, or GPU specifications. |
| Software Dependencies | No | The paper mentions using implementations from Kostrikov (2018) for RL algorithms (PPO, A2C) and its own DQN implementation, as well as the ADAM optimizer. However, it does not provide specific version numbers for these software components or any underlying libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | The hyperparameters for these three algorithms are shown in Tables 1, 2 and 3 We use the ADAM (Kingma & Ba, 2014) optimizer. ... During low-level training we train 80 policies using different random seeds. ... For our Ant models, we use a 3-layer MLP with tanh activation functions and a hidden size of 32. For Humanoid we add skip connections between layers and decrease the hidden size to 16. ... We choose the cyclic constraint multipliers for state (λs) and action (λa) to be 0.05 and 0.01 respectively. |