Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features
Authors: Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments. We now empirically examine linear TD with linearly dependent features. Following the practice of Sutton and Barto [2018], we use diminishing learning rates αt = α (t+t0)ξ and βt = cβαt, where ξ (0.5, 1], α > 0, t0 > 0, and cβ > 0 are constants. We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F). The weight convergence to a set is indeed observed. It is within expectation that different λ requires different α, β. Figures 1 and 2 show convergence curves. Each experiment runs for 1.5 106 steps, averaged over 10 runs. |
| Researcher Affiliation | Academia | Zixuan Xie University of Virginia EMAIL; Xinyu Liu University of Virginia EMAIL; Rohan Chandra University of Virginia EMAIL; Shangtong Zhang University of Virginia EMAIL |
| Pseudocode | No | The paper describes the update rules for Linear TD(λ) algorithms (Discounted TD) and (Average Reward TD) using mathematical equations (e.g., 'wt+1 = wt + αt(Rt+1 + γx(St+1) wt x(St) wt)et') but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | The code for this paper is available at https://github.com/Wenny Xie/Linear TDLambda. |
| Open Datasets | Yes | We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F). |
| Dataset Splits | No | The paper uses a variant of Boyan's chain, a Markov Decision Process (MDP) environment. Experiments involve simulating the environment and generating data through interaction, rather than using predefined training, validation, or test dataset splits. The paper describes the environment setup but does not specify any dataset splits. |
| Hardware Specification | Yes | These experiments were conducted on a server equipped with an AMD EPYC 9534 64-Core Processor, with each run taking approximately 1 minute to complete. Memory requirements are negligible. |
| Software Dependencies | No | The paper does not explicitly state any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Following the practice of Sutton and Barto [2018], we use diminishing learning rates αt = α (t+t0)ξ and βt = cβαt, where ξ (0.5, 1], α > 0, t0 > 0, and cβ > 0 are constants. We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F). In experiments, for discounted TD (Figure 1), gamma = 0.9, alpha0 {0.005, 0.01}. For average reward TD (Figure 2), beta0 = 0.01, alpha0 {0.01, 0.02, 0.1}. |