Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning
Authors: Brett Daley, Martha White, Christopher Amato, Marlos C. Machado
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in several off-policy control tasks. We compare RBIS against Retrace, Truncated IS, and Recursive Retrace when learning this task from off-policy data. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta, Edmonton, AB, Canada 2Alberta Machine Intelligence Institute 3Canada CIFAR AI Chair 4Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. |
| Pseudocode | Yes | Algorithm 1 Truncated Importance Sampling and Algorithm 2 Recency-Bounded Importance Sampling (RBIS) are presented in Appendix D. |
| Open Source Code | Yes | Our experiment code is available online.1 https://github.com/brett-daley/trajectory-aware-etraces |
| Open Datasets | No | The paper uses custom 'Bifurcated Gridworld' environments and does not provide concrete access information for a publicly available or open dataset. |
| Dataset Splits | No | The paper mentions a 'training set of 1,000 trials' for hyperparameter search and a 'separate test set of 1,000 trials' but does not explicitly specify a distinct validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | The target policy used ϵ = 0.1. The behavior policy used a piecewise schedule: ϵ = 1 for the first 5 episodes and then ϵ = 0.2 afterwards. The agents learned from online TD updates with eligibility traces... The initial value function Q was sampled from a zero-mean Gaussian distribution with standard deviation σ = 0.01. We trained each agent for 3,000 timesteps... we searched over λ {0, 0.1, . . . , 1} and α {0.1, 0.3, 0.5, 0.7, 0.9}. |