Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks
Authors: Sungryull Sohn, Sungtae Lee, Jongwook Choi, Harm H Van Seijen, Mehdi Fatemi, Honglak Lee
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our numerical experiment in a tabular RL setting demonstrates that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on Mini Grid, Deep Mind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration even in continuous control tasks, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions. |
| Researcher Affiliation | Collaboration | 1University of Michigan 2LG AI Research 3Yonsei University 4Microsoft Research. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We used the publicly available codebase (Savinov et al., 2018b) to obtain the baseline results.' but does not provide a link or statement about their own code for the proposed method. |
| Open Datasets | Yes | We evaluate our SPRL on four challenging domains: Mini Grid (Chevalier-Boisvert et al., 2018), Deep Mind Lab (Beattie et al., 2016), Atari (Bellemare et al., 2013), and Fetch (Plappert et al., 2018). |
| Dataset Splits | No | The paper does not specify training/test/validation dataset splits or percentages in the way supervised learning typically would, as it deals with reinforcement learning environments. |
| Hardware Specification | No | The paper mentions '32 parallel environments' and '12 parallel environments' in the context of comparisons with other methods (RND, SIL), implying distributed training, but does not provide specific hardware details such as GPU or CPU models used for their experiments. |
| Software Dependencies | No | The paper mentions software like 'PPO', 'episodic curiosity (ECO)', 'intrinsic curiosity module (ICM)', and 'GT-Grid', and states 'We used the publicly available codebase (Savinov et al., 2018b) to obtain the baseline results.' However, it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We used the same hyperparameter for all the tasks for a given domain the details are described in the Appendix. We used the standard domain and tasks for reproducibility. ... See Appendix D, Appendix E, Appendix F, and Appendix G for more details of Mini Grid, Deep Mind Lab, Atari, and Fetch respectively. |