Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

Authors: Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, Honglak Lee

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima.
Researcher Affiliation Collaboration Yijie Guo1 Jongwook Choi1 Marcin Moczulski2 Shengyu Feng1 Samy Bengio2 Mohammad Norouzi2 Honglak Lee2,1 1University of Michigan 2Google Brain {guoyijie,jwook,shengyuf}@umich.edu moczulski@google.com {bengio,mnorouzi,honglak}@google.com
Pseudocode Yes Pseudocode for organizing clusters is in the appendix. The pseudo-code algorithm of sampling the states is in the appendix.
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a direct link to a code repository.
Open Datasets Yes We evaluate our method on the hard-exploration games in the Arcade Learning Environment [8, 30].
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While standard environments are used, explicit split details are not in the main text.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using Proximal Policy Optimization [48] as an algorithm but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use Proximal Policy Optimization [48] as an actor-critic policy gradient algorithm for our experiments. For each episode, we record u to denote the index of state in the given demonstration that is lastly visited by the agent. ... rim is the imitation reward with a value of 0.1 in our experiments. ... λ is the hyper-parameter controlling the weight of exploration term.