Backstepping Temporal Difference Learning

Authors: Han-Dong Lim, Donghwan Lee

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995), Random Walk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002). The details about the environments are given in Appendix Section 7.7. From the experiments, we see how BTD behaves under different coefficients ∈ {−0.5, −0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs.
Researcher Affiliation Academia Han-Dong Lim Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea limaries30@kaist.ac.kr Donghwan Lee Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea donghwan@kaist.ac.kr
Pseudocode Yes With Algorithm 1 in Appendix, k ! as k ! 1 with probability one, where is the fixed point of (6). Consider Algorithm 2 in Appendix. Algorithm 5 in Appendix.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995), Random Walk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002).
Dataset Splits No The paper mentions environments and benchmarks but does not specify training, validation, or test dataset splits, percentages, or methodology for partitioning the data.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, memory, or cloud computing instances used for the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers for replication.
Experiment Setup Yes From the experiments, we see how BTD behaves under different coefficients ∈ {−0.5, −0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs. From Table 1, Backstepping TD, step-size = 0.01