Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning
Authors: Brett Daley, Martha White, Christopher Amato, Marlos C. Machado
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in several off-policy control tasks. We compare RBIS against Retrace, Truncated IS, and Recursive Retrace when learning this task from off-policy data. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta, Edmonton, AB, Canada 2Alberta Machine Intelligence Institute 3Canada CIFAR AI Chair 4Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. |
| Pseudocode | Yes | Algorithm 1 Truncated Importance Sampling and Algorithm 2 Recency-Bounded Importance Sampling (RBIS) are presented in Appendix D. |
| Open Source Code | Yes | Our experiment code is available online.1 https://github.com/brett-daley/trajectory-aware-etraces |
| Open Datasets | No | The paper uses custom 'Bifurcated Gridworld' environments and does not provide concrete access information for a publicly available or open dataset. |
| Dataset Splits | No | The paper mentions a 'training set of 1,000 trials' for hyperparameter search and a 'separate test set of 1,000 trials' but does not explicitly specify a distinct validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | The target policy used ϵ = 0.1. The behavior policy used a piecewise schedule: ϵ = 1 for the first 5 episodes and then ϵ = 0.2 afterwards. The agents learned from online TD updates with eligibility traces... The initial value function Q was sampled from a zero-mean Gaussian distribution with standard deviation σ = 0.01. We trained each agent for 3,000 timesteps... we searched over λ {0, 0.1, . . . , 1} and α {0.1, 0.3, 0.5, 0.7, 0.9}. |