Backstepping Temporal Difference Learning
Authors: Han-Dong Lim, Donghwan Lee
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995), Random Walk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002). The details about the environments are given in Appendix Section 7.7. From the experiments, we see how BTD behaves under different coefficients ∈ {−0.5, −0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs. |
| Researcher Affiliation | Academia | Han-Dong Lim Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea limaries30@kaist.ac.kr Donghwan Lee Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea donghwan@kaist.ac.kr |
| Pseudocode | Yes | With Algorithm 1 in Appendix, k ! as k ! 1 with probability one, where is the fixed point of (6). Consider Algorithm 2 in Appendix. Algorithm 5 in Appendix. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995), Random Walk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002). |
| Dataset Splits | No | The paper mentions environments and benchmarks but does not specify training, validation, or test dataset splits, percentages, or methodology for partitioning the data. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models, memory, or cloud computing instances used for the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers for replication. |
| Experiment Setup | Yes | From the experiments, we see how BTD behaves under different coefficients ∈ {−0.5, −0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs. From Table 1, Backstepping TD, step-size = 0.01 |