reproducibilityindex.ai

Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Authors: Tengyu Xu, Shaofeng Zou, Yingbin Liang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize. (Abstract) In this section, we provide numerical experiments to verify our theoretical results and the efﬁciency of Algorithm 1. (Section 4, Experimental Results)
Researcher Affiliation	Academia	Tengyu Xu Department of Electrical and Computer Engineering The Ohio State University xu.3260@osu.edu Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York szou3@buffalo.edu Yingbin Liang Department of Electrical and Computer Engineering The Ohio State University liang.889@osu.edu
Pseudocode	Yes	Algorithm 1 Blockwise Diminishing Stepsize TDC (Section 3.3)
Open Source Code	No	The paper does not provide any statement about releasing source code, nor does it include a link to a code repository.
Open Datasets	No	The paper uses 'Garnet problems [1]' which are described as a method for generating Markov decision processes. It specifies parameters for this generation (e.g., G(500, 20, 50, 20)), but does not refer to a publicly available or open dataset that can be accessed via a link, DOI, repository, or formal citation.
Dataset Splits	No	The paper conducts experiments in a simulated Markov Decision Process environment using generated problems, but it does not describe specific dataset splits (e.g., percentages, sample counts) for training, validation, or testing.
Hardware Specification	No	The paper does not specify any hardware components used for running the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types).
Software Dependencies	No	The paper does not provide specific software dependencies, such as programming languages or libraries with version numbers, used for the experiments.
Experiment Setup	Yes	For all experiments, we choose θ0 = w0 = 0. (Section 4, Experimental Results) For diminishing stepsize, we set cα = cβ and σ = 3/2ν, and tune their values to the best, which are given by cα = cβ = 1.8, σ = 3/2ν = 0.45. For the four constant-stepsize cases, we ﬁx α for each case, and tune β to the best. The resulting parameter settings are respectively as follows: αt = 0.01, βt = 0.006; αt = 0.02, βt = 0.008; αt = 0.05, βt = 0.02; and αt = 0.1, βt = 0.02. (Section 4.2, Constant Stepsize vs Diminishing Stepsize)