PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Authors: Ziwei Guan, Tengyu Xu, Yingbin Liang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments validate the superior performance of PER-ETD and its advantage over ETD.
Researcher Affiliation Academia Ziwei Guan, Tengyu Xu & Yingbin Liang Department of Electrical and Computer Engineering Ohio State University Columbus, OH 43210, USA
Pseudocode Yes Algorithm 1 PER-ETD(0), Algorithm 2 PER-ETD(λ)
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets Yes We consider the BAIRD counter-example. The details of the MDP setting and behavior and target policies could be found in Appendix A.1. The BAIRD counter-example is illustrated in Figure 4, which has 7 states and 2 actions... We choose the target policy as π(0|s) = 0.1 and π(1|s) = 0.9 for all states; and choose the behavior policy as µ(0|s) = 6/7 and µ(1|s) = 1/7 for all states. Moreover, we specify the discount factor γ = 0.99.
Dataset Splits No The paper describes the MDP environment and policies but does not provide specific training/validation/test dataset splits. For this type of reinforcement learning research, data is generated through interaction with the defined environment rather than from pre-collected fixed splits.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We adopt a constant learning rate for both PERETD(0) and PER-ETD(λ) and all experiments take an average over 20 random initialization. We set the stepsize η = 2 9 for all algorithms for fair comparison. For PER-ETD(0), we adopt onedimensional features Φ1 = (0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.37) .