Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards
Authors: Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, Honglak Lee
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima. |
| Researcher Affiliation | Collaboration | Yijie Guo1 Jongwook Choi1 Marcin Moczulski2 Shengyu Feng1 Samy Bengio2 Mohammad Norouzi2 Honglak Lee2,1 1University of Michigan 2Google Brain EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Pseudocode for organizing clusters is in the appendix. The pseudo-code algorithm of sampling the states is in the appendix. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our method on the hard-exploration games in the Arcade Learning Environment [8, 30]. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While standard environments are used, explicit split details are not in the main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using Proximal Policy Optimization [48] as an algorithm but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use Proximal Policy Optimization [48] as an actor-critic policy gradient algorithm for our experiments. For each episode, we record u to denote the index of state in the given demonstration that is lastly visited by the agent. ... rim is the imitation reward with a value of 0.1 in our experiments. ... λ is the hyper-parameter controlling the weight of exploration term. |