Effectively Learning Initiation Sets in Hierarchical Reinforcement Learning

Authors: Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, George Konidaris

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our method learns higher-quality initiation sets faster than existing methods (in MINIGRID and MONTEZUMA S REVENGE), can automatically discover promising grasps for robot manipulation (in ROBOSUITE), and improves the performance of a state-of-the-art option discovery method in a challenging maze navigation task in Mu Jo Co.
Researcher Affiliation Collaboration Akhil Bagaria Brown University Providence, RI, USA. akhil_bagaria@brown.edu Ben Abbatematteo Brown University, Providence, RI, USA. abba@brown.edu Omer Gottesman Amazon, New York, NY, USA. omergott@gmail.com Matt Corsaro Brown University, Providence, RI, USA. matthew_corsaro@brown.edu Sreehari Rammohan Brown University, Providence, RI, USA. sreehari_rammohan@brown.edu George Konidaris Brown University, Providence, RI, USA. gdk@cs.brown.edu
Pseudocode Yes Algorithm 1 is the pseudocode used for the experiments described in Section 4.1. Algorithm 2 Robust DSC Rollout Algorithm 3 Robust DSC Algorithm
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper, nor does it explicitly state that the code is released.
Open Datasets Yes MINIGRID-FOURROOMS [Chevalier-Boisvert et al., 2018] and the first screen of MONTEZUMA S REVENGE [Bellemare et al., 2013]. We use three constrained manipulation tasks in ROBOSUITE [Zhu et al., 2020]. We use the ANT MEDIUM MAZE environment [Fu et al., 2020, Todorov et al., 2012].
Dataset Splits Yes The agent is evaluated by rolling out the learned policy once every 10 episodes; during evaluation, the agent starts from a small region around (0, 0), during training it starts at a location randomly sampled from the open locations in the maze.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF is learned using Fitted Q-Evaluation [Le et al., 2019], prioritized experience replay [Schaul et al., 2016] and target networks [Mnih et al., 2015]. The paper lists software components but does not specify their version numbers.
Experiment Setup Yes Implementation Details. Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF Q-function and initiation classifier are parameterized using neural networks that have the same architecture as the Rainbow/TD3. Each option has a gestation period of 5 [Konidaris and Barto, 2009]. ... Their hyperparameters (Tables 2 and 5) were not tuned and are either identical to the original paper implementation or borrowed from Bagaria et al. [2021a]. The bonus scale c (described in Sec 3.3) was tuned over the set {0.05, 0.1, 0.25, 0.5, 1.0}, the best performing hyperparameters are listed in Table 3.