reproducibilityindex.ai

From Past to Future: Rethinking Eligibility Traces

Authors: Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical analyses, we illustrate how effectively bidirectional value functions can be used in policy evaluation. Our results suggest that the methods proposed can outperform the traditional TD(λ) technique, especially in settings involving complex non-linear approximators. Experiments We investigate three questions: RQ1: Can we jointly parameterize all three value functions such that learning each of them individually helps to learn the other two? RQ2: Can learning such value functions facilitate/accelerate the process of evaluating forward value functions compared to standard techniques like TD(λ)? RQ3: What is the inﬂuence of λ on the method s performance?
Researcher Affiliation	Collaboration	Dhawal Gupta1*, Scott M. Jordan2, Shreyas Chaudhari1, Bo Liu3, Philip S. Thomas1, Bruno Castro da Silva1 1University of Massachusetts, 2University of Alberta, 3Amazon
Pseudocode	No	The paper describes update equations mathematically (e.g., Equation 10 and 11) but does not provide structured pseudocode or an algorithm block.
Open Source Code	No	The paper does not provide any statements about releasing code or links to a code repository for the described methodology.
Open Datasets	No	For our prediction problem, we consider a chain domain with 9 states (Sutton and Barto 2018) as depicted in Figure 4. The initial state is drawn from a uniform distribution over the state space. The agent can only take two actions (go left and go right) and the ending states of the chain are terminal. No specific access information for a publicly available dataset is provided.
Dataset Splits	No	Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. No explicit mention of validation or train/validation/test splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not specify any software versions or dependencies used in the experiments.
Experiment Setup	Yes	In our experiments, we sweep over multiple values of learning rate ( ) and λ. We use the learning rate for all value function heads. Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. We used a single-layer neural network with 9 units in the hidden layer and Re LU as the non-linearity. We evaluate each TD learning algorithm (along with the Monte Carlo variant for learning v) in terms of their ability to approximate the value function of a uniform random policy, (left\| ) = (right\| ) = 0.5, under a discount factor of γ = 0.99.