Landmark-Guided Subgoal Generation in Hierarchical Reinforcement Learning
Authors: Junsu Kim, Younggyo Seo, Jinwoo Shin
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that our framework outperforms prior-arts across a variety of control tasks, thanks to efficient exploration guided by landmarks. |
| Researcher Affiliation | Academia | 1Kim Jaechul Graduate School of AI 2School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) |
| Pseudocode | No | We provide an illustration and an overall description of our framework in Figure 1 and Algorithm ??, respectively. |
| Open Source Code | Yes | Code is available https://github.com/junsu-kim97/HIGL |
| Open Datasets | Yes | We conduct our experiments on a set of challenging long-horizon continuous control tasks based on Mu Jo Co simulator [48]. Specifically, we consider the following environments to evaluate our framework (see Figure 2 for the visualization of environments). Point Maze [7], Ant Maze (U-shape) [7], Ant Maze (W-shape) [54], Reacher [4], Pusher [4], Stochastic Ant Maze (U-shape) [54]. |
| Dataset Splits | No | The paper does not provide specific training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | Yes | All of the experiments were processed using a single GPU (NVIDIA TITAN Xp) and 8 CPU cores (Intel Xeon E5-2630 v4). |
| Software Dependencies | No | The paper states using the TD3 algorithm [10] but does not list specific software libraries or their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the number of coverage-based landmarks Mcov and the number of novelty-based landmarks Mcov, we use Mcov = 20 and Mnov = 20 in all the environments except Ant Maze (W-shape). We use Mcov = 40 and Mnov = 40 in the more complex Ant Maze (W-shape) environment. In order to avoid the instability in training due to the noisy pseudo-landmark in the early phase of training, we use δpseudo = 0 for the initial 60K timesteps, i.e., k-step adjacent region to the current state instead of pseudo-landmark. |