Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method
Authors: Ziwei Guan, Tengyu Xu, Yingbin Liang
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments validate the superior performance of PER-ETD and its advantage over ETD. |
| Researcher Affiliation | Academia | Ziwei Guan, Tengyu Xu & Yingbin Liang Department of Electrical and Computer Engineering Ohio State University Columbus, OH 43210, USA |
| Pseudocode | Yes | Algorithm 1 PER-ETD(0), Algorithm 2 PER-ETD(λ) |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the methodology described. |
| Open Datasets | Yes | We consider the BAIRD counter-example. The details of the MDP setting and behavior and target policies could be found in Appendix A.1. The BAIRD counter-example is illustrated in Figure 4, which has 7 states and 2 actions... We choose the target policy as π(0|s) = 0.1 and π(1|s) = 0.9 for all states; and choose the behavior policy as µ(0|s) = 6/7 and µ(1|s) = 1/7 for all states. Moreover, we specify the discount factor γ = 0.99. |
| Dataset Splits | No | The paper describes the MDP environment and policies but does not provide specific training/validation/test dataset splits. For this type of reinforcement learning research, data is generated through interaction with the defined environment rather than from pre-collected fixed splits. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We adopt a constant learning rate for both PERETD(0) and PER-ETD(λ) and all experiments take an average over 20 random initialization. We set the stepsize η = 2 9 for all algorithms for fair comparison. For PER-ETD(0), we adopt onedimensional features Φ1 = (0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.37) . |