Source Traces for Temporal Difference Learning
Authors: Silviu Pitis
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 2 (left), which reflects the 3D Gridworld, plots learning curves after 100,000 steps for the source learning algorithm given by equation 4 for S1 (TD(0)) through S4 and S. A similar pattern appears for Sλ with increasing λ. In each case, v0 was initialized to 0, and vn v was averaged across the MRPs (v was computed by matrix inversion). The curves in figure 2 (left) are not representative of all learning rates. Figure 2 (center) shows the final error achieved by TD(0), TD(λ) at the best λ (tested in 0.1 increments), S4 and S at various fixed α. All experiments, unless otherwise noted, reflect average results on 30 Random MRP or 3D Gridworld environments. |
| Researcher Affiliation | Academia | Silviu Pitis Georgia Institute of Technology Atlanta, GA, USA 30332 spitis@gatech.edu |
| Pseudocode | Yes | Algorithm 1 Tabular TD learning with source traces |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes generating environments ("Random MRP with 100 states", "1000-state 3D Gridworld") for experiments rather than using a pre-existing, publicly available dataset with concrete access information. No links or citations to specific public datasets are provided. |
| Dataset Splits | No | The paper describes using multiple generated environments (30 Random MRP or 3D Gridworld environments) and averaging results, but does not specify a train/validation/test split for a dataset, nor does it refer to predefined splits with citations or provide cross-validation details. The mention of 'validation' in the paper refers to the process of evaluating the learned value function against the true value (vn v). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud resources). |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | We tested the following set of annealing schedules, adapted from Geramifard et al. 2007: αn = α0(N0 + 1)/(N0 + n1.1) for α {5e-1, 2e-1, 1e-1, 5e-2, 2e-2, 1e-2, 5e-3} and N0 {0, 1e2, 1e4, 1e6}. In each case, v0 was initialized to 0. β is the learning rate used in the stochastic approximation of S on line 13, and may be fixed or annealed according to some schedule. Starting at λ = 0.5, λ was annealed linearly to 1 over the first 25,000 steps. The replay memory had infinite capacity, and was invoked on every step to replay 3 past steps. |