reproducibilityindex.ai

Landmark-Guided Subgoal Generation in Hierarchical Reinforcement Learning

Authors: Junsu Kim, Younggyo Seo, Jinwoo Shin

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that our framework outperforms prior-arts across a variety of control tasks, thanks to efﬁcient exploration guided by landmarks.
Researcher Affiliation	Academia	1Kim Jaechul Graduate School of AI 2School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode	No	We provide an illustration and an overall description of our framework in Figure 1 and Algorithm ??, respectively.
Open Source Code	Yes	Code is available https://github.com/junsu-kim97/HIGL
Open Datasets	Yes	We conduct our experiments on a set of challenging long-horizon continuous control tasks based on Mu Jo Co simulator [48]. Speciﬁcally, we consider the following environments to evaluate our framework (see Figure 2 for the visualization of environments). Point Maze [7], Ant Maze (U-shape) [7], Ant Maze (W-shape) [54], Reacher [4], Pusher [4], Stochastic Ant Maze (U-shape) [54].
Dataset Splits	No	The paper does not provide specific training, validation, and test dataset splits with percentages or counts.
Hardware Specification	Yes	All of the experiments were processed using a single GPU (NVIDIA TITAN Xp) and 8 CPU cores (Intel Xeon E5-2630 v4).
Software Dependencies	No	The paper states using the TD3 algorithm [10] but does not list specific software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the number of coverage-based landmarks Mcov and the number of novelty-based landmarks Mcov, we use Mcov = 20 and Mnov = 20 in all the environments except Ant Maze (W-shape). We use Mcov = 40 and Mnov = 40 in the more complex Ant Maze (W-shape) environment. In order to avoid the instability in training due to the noisy pseudo-landmark in the early phase of training, we use δpseudo = 0 for the initial 60K timesteps, i.e., k-step adjacent region to the current state instead of pseudo-landmark.