Landmark-Guided Subgoal Generation in Hierarchical Reinforcement Learning

Authors: Junsu Kim, Younggyo Seo, Jinwoo Shin

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our framework outperforms prior-arts across a variety of control tasks, thanks to efficient exploration guided by landmarks.
Researcher Affiliation Academia 1Kim Jaechul Graduate School of AI 2School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode No We provide an illustration and an overall description of our framework in Figure 1 and Algorithm ??, respectively.
Open Source Code Yes Code is available https://github.com/junsu-kim97/HIGL
Open Datasets Yes We conduct our experiments on a set of challenging long-horizon continuous control tasks based on Mu Jo Co simulator [48]. Specifically, we consider the following environments to evaluate our framework (see Figure 2 for the visualization of environments). Point Maze [7], Ant Maze (U-shape) [7], Ant Maze (W-shape) [54], Reacher [4], Pusher [4], Stochastic Ant Maze (U-shape) [54].
Dataset Splits No The paper does not provide specific training, validation, and test dataset splits with percentages or counts.
Hardware Specification Yes All of the experiments were processed using a single GPU (NVIDIA TITAN Xp) and 8 CPU cores (Intel Xeon E5-2630 v4).
Software Dependencies No The paper states using the TD3 algorithm [10] but does not list specific software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the number of coverage-based landmarks Mcov and the number of novelty-based landmarks Mcov, we use Mcov = 20 and Mnov = 20 in all the environments except Ant Maze (W-shape). We use Mcov = 40 and Mnov = 40 in the more complex Ant Maze (W-shape) environment. In order to avoid the instability in training due to the noisy pseudo-landmark in the early phase of training, we use δpseudo = 0 for the initial 60K timesteps, i.e., k-step adjacent region to the current state instead of pseudo-landmark.