Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards
Authors: Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, Honglak Lee
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima. |
| Researcher Affiliation | Collaboration | Yijie Guo1 Jongwook Choi1 Marcin Moczulski2 Shengyu Feng1 Samy Bengio2 Mohammad Norouzi2 Honglak Lee2,1 1University of Michigan 2Google Brain {guoyijie,jwook,shengyuf}@umich.edu moczulski@google.com {bengio,mnorouzi,honglak}@google.com |
| Pseudocode | Yes | Pseudocode for organizing clusters is in the appendix. The pseudo-code algorithm of sampling the states is in the appendix. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our method on the hard-exploration games in the Arcade Learning Environment [8, 30]. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While standard environments are used, explicit split details are not in the main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using Proximal Policy Optimization [48] as an algorithm but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use Proximal Policy Optimization [48] as an actor-critic policy gradient algorithm for our experiments. For each episode, we record u to denote the index of state in the given demonstration that is lastly visited by the agent. ... rim is the imitation reward with a value of 0.1 in our experiments. ... λ is the hyper-parameter controlling the weight of exploration term. |