Target-Based Temporal-Difference Learning

Authors: Donghwan Lee, Niao He

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, we provide some simulation results showing potentially superior convergence of these target-based TD algorithms compared to the standard TD-learning.
Researcher Affiliation Academia 1Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, USA 2Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana Champaign, USA.
Pseudocode Yes Algorithm 1 Standard TD-Learning; Algorithm 2 Averaging TD-Learning (A-TD); Algorithm 3 Double TD-Learning (D-TD); Algorithm 4 Periodic TD-Learning (P-TD).
Open Source Code No The paper does not contain any statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes a simulated MDP environment with specific parameters ('γ = 0.9, |S| = 10, ... and rπ(s) U[0, 20]') and a feature vector based on a cited work (Geramifard et al., 2013), but it does not provide concrete access information (e.g., a link, DOI, or specific repository name) for a publicly available or open dataset that was used.
Dataset Splits No The paper conducts simulations within a defined MDP environment and evaluates error evolution over iterations, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, sample counts, or predefined dataset partitions.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper does not list specific software dependencies with their version numbers, such as programming languages, libraries, or specialized solvers used in the experiments.
Experiment Setup Yes standard TD-learning ... with the step-size, αk = 1000/(k + 10000) and the proposed A-TD ... with the αk = 1000/(k + 10000) and δ = 0.9. ... we employ the adaptive step-size rule, βk,t = (10000 (0.997)k)/(10000 + t) with Lk = 40 for P-TD, and the corresponding simulation results are given in Figure 3, where P-TD outperforms the standard TD with the step-size, αk = 10000/(k+10000), best tuned for comparison.