From Past to Future: Rethinking Eligibility Traces

Authors: Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical analyses, we illustrate how effectively bidirectional value functions can be used in policy evaluation. Our results suggest that the methods proposed can outperform the traditional TD(λ) technique, especially in settings involving complex non-linear approximators. Experiments We investigate three questions: RQ1: Can we jointly parameterize all three value functions such that learning each of them individually helps to learn the other two? RQ2: Can learning such value functions facilitate/accelerate the process of evaluating forward value functions compared to standard techniques like TD(λ)? RQ3: What is the influence of λ on the method s performance?
Researcher Affiliation Collaboration Dhawal Gupta1*, Scott M. Jordan2, Shreyas Chaudhari1, Bo Liu3, Philip S. Thomas1, Bruno Castro da Silva1 1University of Massachusetts, 2University of Alberta, 3Amazon
Pseudocode No The paper describes update equations mathematically (e.g., Equation 10 and 11) but does not provide structured pseudocode or an algorithm block.
Open Source Code No The paper does not provide any statements about releasing code or links to a code repository for the described methodology.
Open Datasets No For our prediction problem, we consider a chain domain with 9 states (Sutton and Barto 2018) as depicted in Figure 4. The initial state is drawn from a uniform distribution over the state space. The agent can only take two actions (go left and go right) and the ending states of the chain are terminal. No specific access information for a publicly available dataset is provided.
Dataset Splits No Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. No explicit mention of validation or train/validation/test splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not specify any software versions or dependencies used in the experiments.
Experiment Setup Yes In our experiments, we sweep over multiple values of learning rate ( ) and λ. We use the learning rate for all value function heads. Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. We used a single-layer neural network with 9 units in the hidden layer and Re LU as the non-linearity. We evaluate each TD learning algorithm (along with the Monte Carlo variant for learning v) in terms of their ability to approximate the value function of a uniform random policy, (left| ) = (right| ) = 0.5, under a discount factor of γ = 0.99.