Effectively Learning Initiation Sets in Hierarchical Reinforcement Learning
Authors: Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, George Konidaris
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our method learns higher-quality initiation sets faster than existing methods (in MINIGRID and MONTEZUMA S REVENGE), can automatically discover promising grasps for robot manipulation (in ROBOSUITE), and improves the performance of a state-of-the-art option discovery method in a challenging maze navigation task in Mu Jo Co. |
| Researcher Affiliation | Collaboration | Akhil Bagaria Brown University Providence, RI, USA. akhil_bagaria@brown.edu Ben Abbatematteo Brown University, Providence, RI, USA. abba@brown.edu Omer Gottesman Amazon, New York, NY, USA. omergott@gmail.com Matt Corsaro Brown University, Providence, RI, USA. matthew_corsaro@brown.edu Sreehari Rammohan Brown University, Providence, RI, USA. sreehari_rammohan@brown.edu George Konidaris Brown University, Providence, RI, USA. gdk@cs.brown.edu |
| Pseudocode | Yes | Algorithm 1 is the pseudocode used for the experiments described in Section 4.1. Algorithm 2 Robust DSC Rollout Algorithm 3 Robust DSC Algorithm |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper, nor does it explicitly state that the code is released. |
| Open Datasets | Yes | MINIGRID-FOURROOMS [Chevalier-Boisvert et al., 2018] and the first screen of MONTEZUMA S REVENGE [Bellemare et al., 2013]. We use three constrained manipulation tasks in ROBOSUITE [Zhu et al., 2020]. We use the ANT MEDIUM MAZE environment [Fu et al., 2020, Todorov et al., 2012]. |
| Dataset Splits | Yes | The agent is evaluated by rolling out the learned policy once every 10 episodes; during evaluation, the agent starts from a small region around (0, 0), during training it starts at a location randomly sampled from the open locations in the maze. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF is learned using Fitted Q-Evaluation [Le et al., 2019], prioritized experience replay [Schaul et al., 2016] and target networks [Mnih et al., 2015]. The paper lists software components but does not specify their version numbers. |
| Experiment Setup | Yes | Implementation Details. Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF Q-function and initiation classifier are parameterized using neural networks that have the same architecture as the Rainbow/TD3. Each option has a gestation period of 5 [Konidaris and Barto, 2009]. ... Their hyperparameters (Tables 2 and 5) were not tuned and are either identical to the original paper implementation or borrowed from Bagaria et al. [2021a]. The bonus scale c (described in Sec 3.3) was tuned over the set {0.05, 0.1, 0.25, 0.5, 1.0}, the best performing hyperparameters are listed in Table 3. |