Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Authors: Tengyu Xu, Shaofeng Zou, Yingbin Liang

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize. (Abstract) In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. (Section 4, Experimental Results)
Researcher Affiliation Academia Tengyu Xu Department of Electrical and Computer Engineering The Ohio State University EMAIL Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York EMAIL Yingbin Liang Department of Electrical and Computer Engineering The Ohio State University EMAIL
Pseudocode Yes Algorithm 1 Blockwise Diminishing Stepsize TDC (Section 3.3)
Open Source Code No The paper does not provide any statement about releasing source code, nor does it include a link to a code repository.
Open Datasets No The paper uses 'Garnet problems [1]' which are described as a method for generating Markov decision processes. It specifies parameters for this generation (e.g., G(500, 20, 50, 20)), but does not refer to a publicly available or open dataset that can be accessed via a link, DOI, repository, or formal citation.
Dataset Splits No The paper conducts experiments in a simulated Markov Decision Process environment using generated problems, but it does not describe specific dataset splits (e.g., percentages, sample counts) for training, validation, or testing.
Hardware Specification No The paper does not specify any hardware components used for running the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types).
Software Dependencies No The paper does not provide specific software dependencies, such as programming languages or libraries with version numbers, used for the experiments.
Experiment Setup Yes For all experiments, we choose θ0 = w0 = 0. (Section 4, Experimental Results) For diminishing stepsize, we set cα = cβ and σ = 3/2ν, and tune their values to the best, which are given by cα = cβ = 1.8, σ = 3/2ν = 0.45. For the four constant-stepsize cases, we fix α for each case, and tune β to the best. The resulting parameter settings are respectively as follows: αt = 0.01, βt = 0.006; αt = 0.02, βt = 0.008; αt = 0.05, βt = 0.02; and αt = 0.1, βt = 0.02. (Section 4.2, Constant Stepsize vs Diminishing Stepsize)