Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

Authors: Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Experiments. We now empirically examine linear TD with linearly dependent features. Following the practice of Sutton and Barto [2018], we use diminishing learning rates αt = α (t+t0)ξ and βt = cβαt, where ξ (0.5, 1], α > 0, t0 > 0, and cβ > 0 are constants. We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F). The weight convergence to a set is indeed observed. It is within expectation that different λ requires different α, β. Figures 1 and 2 show convergence curves. Each experiment runs for 1.5 106 steps, averaged over 10 runs.
Researcher Affiliation Academia Zixuan Xie University of Virginia EMAIL; Xinyu Liu University of Virginia EMAIL; Rohan Chandra University of Virginia EMAIL; Shangtong Zhang University of Virginia EMAIL
Pseudocode No The paper describes the update rules for Linear TD(λ) algorithms (Discounted TD) and (Average Reward TD) using mathematical equations (e.g., 'wt+1 = wt + αt(Rt+1 + γx(St+1) wt x(St) wt)et') but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes The code for this paper is available at https://github.com/Wenny Xie/Linear TDLambda.
Open Datasets Yes We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F).
Dataset Splits No The paper uses a variant of Boyan's chain, a Markov Decision Process (MDP) environment. Experiments involve simulating the environment and generating data through interaction, rather than using predefined training, validation, or test dataset splits. The paper describes the environment setup but does not specify any dataset splits.
Hardware Specification Yes These experiments were conducted on a server equipped with an AMD EPYC 9534 64-Core Processor, with each run taking approximately 1 minute to complete. Memory requirements are negligible.
Software Dependencies No The paper does not explicitly state any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes Following the practice of Sutton and Barto [2018], we use diminishing learning rates αt = α (t+t0)ξ and βt = cβαt, where ξ (0.5, 1], α > 0, t0 > 0, and cβ > 0 are constants. We use a variant of Boyan s chain [Boyan, 1999] with 15 states (|S| = 15) and 5 actions (|A| = 5) under a uniform policy π(a|s) = 1/|A|, where the feature matrix X R15 5 is designed to be of rank 3 (more details in Section F). In experiments, for discounted TD (Figure 1), gamma = 0.9, alpha0 {0.005, 0.01}. For average reward TD (Figure 2), beta0 = 0.01, alpha0 {0.01, 0.02, 0.1}.