Source Traces for Temporal Difference Learning

Authors: Silviu Pitis

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2 (left), which reflects the 3D Gridworld, plots learning curves after 100,000 steps for the source learning algorithm given by equation 4 for S1 (TD(0)) through S4 and S. A similar pattern appears for Sλ with increasing λ. In each case, v0 was initialized to 0, and vn v was averaged across the MRPs (v was computed by matrix inversion). The curves in figure 2 (left) are not representative of all learning rates. Figure 2 (center) shows the final error achieved by TD(0), TD(λ) at the best λ (tested in 0.1 increments), S4 and S at various fixed α. All experiments, unless otherwise noted, reflect average results on 30 Random MRP or 3D Gridworld environments.
Researcher Affiliation Academia Silviu Pitis Georgia Institute of Technology Atlanta, GA, USA 30332 spitis@gatech.edu
Pseudocode Yes Algorithm 1 Tabular TD learning with source traces
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes generating environments ("Random MRP with 100 states", "1000-state 3D Gridworld") for experiments rather than using a pre-existing, publicly available dataset with concrete access information. No links or citations to specific public datasets are provided.
Dataset Splits No The paper describes using multiple generated environments (30 Random MRP or 3D Gridworld environments) and averaging results, but does not specify a train/validation/test split for a dataset, nor does it refer to predefined splits with citations or provide cross-validation details. The mention of 'validation' in the paper refers to the process of evaluating the learned value function against the true value (vn v).
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud resources).
Software Dependencies No The paper does not list any specific software dependencies with version numbers.
Experiment Setup Yes We tested the following set of annealing schedules, adapted from Geramifard et al. 2007: αn = α0(N0 + 1)/(N0 + n1.1) for α {5e-1, 2e-1, 1e-1, 5e-2, 2e-2, 1e-2, 5e-3} and N0 {0, 1e2, 1e4, 1e6}. In each case, v0 was initialized to 0. β is the learning rate used in the stochastic approximation of S on line 13, and may be fixed or annealed according to some schedule. Starting at λ = 0.5, λ was annealed linearly to 1 over the first 25,000 steps. The replay memory had infinite capacity, and was invoked on every step to replay 3 past steps.