reproducibilityindex.ai

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

Authors: Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, Honglak Lee

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that our approach signiﬁcantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima.
Researcher Affiliation	Collaboration	Yijie Guo1 Jongwook Choi1 Marcin Moczulski2 Shengyu Feng1 Samy Bengio2 Mohammad Norouzi2 Honglak Lee2,1 1University of Michigan 2Google Brain {guoyijie,jwook,shengyuf}@umich.edu moczulski@google.com {bengio,mnorouzi,honglak}@google.com
Pseudocode	Yes	Pseudocode for organizing clusters is in the appendix. The pseudo-code algorithm of sampling the states is in the appendix.
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We evaluate our method on the hard-exploration games in the Arcade Learning Environment [8, 30].
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While standard environments are used, explicit split details are not in the main text.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using Proximal Policy Optimization [48] as an algorithm but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use Proximal Policy Optimization [48] as an actor-critic policy gradient algorithm for our experiments. For each episode, we record u to denote the index of state in the given demonstration that is lastly visited by the agent. ... rim is the imitation reward with a value of 0.1 in our experiments. ... λ is the hyper-parameter controlling the weight of exploration term.