Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

Authors: Brett Daley, Martha White, Christopher Amato, Marlos C. Machado

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in several off-policy control tasks. We compare RBIS against Retrace, Truncated IS, and Recursive Retrace when learning this task from off-policy data.
Researcher Affiliation Academia 1Department of Computing Science, University of Alberta, Edmonton, AB, Canada 2Alberta Machine Intelligence Institute 3Canada CIFAR AI Chair 4Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
Pseudocode Yes Algorithm 1 Truncated Importance Sampling and Algorithm 2 Recency-Bounded Importance Sampling (RBIS) are presented in Appendix D.
Open Source Code Yes Our experiment code is available online.1 https://github.com/brett-daley/trajectory-aware-etraces
Open Datasets No The paper uses custom 'Bifurcated Gridworld' environments and does not provide concrete access information for a publicly available or open dataset.
Dataset Splits No The paper mentions a 'training set of 1,000 trials' for hyperparameter search and a 'separate test set of 1,000 trials' but does not explicitly specify a distinct validation dataset split.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup Yes The target policy used ϵ = 0.1. The behavior policy used a piecewise schedule: ϵ = 1 for the first 5 episodes and then ϵ = 0.2 afterwards. The agents learned from online TD updates with eligibility traces... The initial value function Q was sampled from a zero-mean Gaussian distribution with standard deviation σ = 0.01. We trained each agent for 3,000 timesteps... we searched over λ {0, 0.1, . . . , 1} and α {0.1, 0.3, 0.5, 0.7, 0.9}.