Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks

Authors: Sungryull Sohn, Sungtae Lee, Jongwook Choi, Harm H Van Seijen, Mehdi Fatemi, Honglak Lee

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical experiment in a tabular RL setting demonstrates that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on Mini Grid, Deep Mind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration even in continuous control tasks, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.
Researcher Affiliation Collaboration 1University of Michigan 2LG AI Research 3Yonsei University 4Microsoft Research.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'We used the publicly available codebase (Savinov et al., 2018b) to obtain the baseline results.' but does not provide a link or statement about their own code for the proposed method.
Open Datasets Yes We evaluate our SPRL on four challenging domains: Mini Grid (Chevalier-Boisvert et al., 2018), Deep Mind Lab (Beattie et al., 2016), Atari (Bellemare et al., 2013), and Fetch (Plappert et al., 2018).
Dataset Splits No The paper does not specify training/test/validation dataset splits or percentages in the way supervised learning typically would, as it deals with reinforcement learning environments.
Hardware Specification No The paper mentions '32 parallel environments' and '12 parallel environments' in the context of comparisons with other methods (RND, SIL), implying distributed training, but does not provide specific hardware details such as GPU or CPU models used for their experiments.
Software Dependencies No The paper mentions software like 'PPO', 'episodic curiosity (ECO)', 'intrinsic curiosity module (ICM)', and 'GT-Grid', and states 'We used the publicly available codebase (Savinov et al., 2018b) to obtain the baseline results.' However, it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes We used the same hyperparameter for all the tasks for a given domain the details are described in the Appendix. We used the standard domain and tasks for reproducibility. ... See Appendix D, Appendix E, Appendix F, and Appendix G for more details of Mini Grid, Deep Mind Lab, Atari, and Fetch respectively.