From Past to Future: Rethinking Eligibility Traces
Authors: Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical analyses, we illustrate how effectively bidirectional value functions can be used in policy evaluation. Our results suggest that the methods proposed can outperform the traditional TD(λ) technique, especially in settings involving complex non-linear approximators. Experiments We investigate three questions: RQ1: Can we jointly parameterize all three value functions such that learning each of them individually helps to learn the other two? RQ2: Can learning such value functions facilitate/accelerate the process of evaluating forward value functions compared to standard techniques like TD(λ)? RQ3: What is the influence of λ on the method s performance? |
| Researcher Affiliation | Collaboration | Dhawal Gupta1*, Scott M. Jordan2, Shreyas Chaudhari1, Bo Liu3, Philip S. Thomas1, Bruno Castro da Silva1 1University of Massachusetts, 2University of Alberta, 3Amazon |
| Pseudocode | No | The paper describes update equations mathematically (e.g., Equation 10 and 11) but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not provide any statements about releasing code or links to a code repository for the described methodology. |
| Open Datasets | No | For our prediction problem, we consider a chain domain with 9 states (Sutton and Barto 2018) as depicted in Figure 4. The initial state is drawn from a uniform distribution over the state space. The agent can only take two actions (go left and go right) and the ending states of the chain are terminal. No specific access information for a publicly available dataset is provided. |
| Dataset Splits | No | Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. No explicit mention of validation or train/validation/test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify any software versions or dependencies used in the experiments. |
| Experiment Setup | Yes | In our experiments, we sweep over multiple values of learning rate ( ) and λ. We use the learning rate for all value function heads. Each run corresponds to 50K training/environment steps, and we average the loss function over 100 seeds. We used a single-layer neural network with 9 units in the hidden layer and Re LU as the non-linearity. We evaluate each TD learning algorithm (along with the Monte Carlo variant for learning v) in terms of their ability to approximate the value function of a uniform random policy, (left| ) = (right| ) = 0.5, under a discount factor of γ = 0.99. |