Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Authors: Tengyu Xu, Shaofeng Zou, Yingbin Liang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize. (Abstract) In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. (Section 4, Experimental Results)
Researcher Affiliation Academia Tengyu Xu Department of Electrical and Computer Engineering The Ohio State University xu.3260@osu.edu Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York szou3@buffalo.edu Yingbin Liang Department of Electrical and Computer Engineering The Ohio State University liang.889@osu.edu
Pseudocode Yes Algorithm 1 Blockwise Diminishing Stepsize TDC (Section 3.3)
Open Source Code No The paper does not provide any statement about releasing source code, nor does it include a link to a code repository.
Open Datasets No The paper uses 'Garnet problems [1]' which are described as a method for generating Markov decision processes. It specifies parameters for this generation (e.g., G(500, 20, 50, 20)), but does not refer to a publicly available or open dataset that can be accessed via a link, DOI, repository, or formal citation.
Dataset Splits No The paper conducts experiments in a simulated Markov Decision Process environment using generated problems, but it does not describe specific dataset splits (e.g., percentages, sample counts) for training, validation, or testing.
Hardware Specification No The paper does not specify any hardware components used for running the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types).
Software Dependencies No The paper does not provide specific software dependencies, such as programming languages or libraries with version numbers, used for the experiments.
Experiment Setup Yes For all experiments, we choose θ0 = w0 = 0. (Section 4, Experimental Results) For diminishing stepsize, we set cα = cβ and σ = 3/2ν, and tune their values to the best, which are given by cα = cβ = 1.8, σ = 3/2ν = 0.45. For the four constant-stepsize cases, we fix α for each case, and tune β to the best. The resulting parameter settings are respectively as follows: αt = 0.01, βt = 0.006; αt = 0.02, βt = 0.008; αt = 0.05, βt = 0.02; and αt = 0.1, βt = 0.02. (Section 4.2, Constant Stepsize vs Diminishing Stepsize)