Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples
Authors: Tengyu Xu, Shaofeng Zou, Yingbin Liang
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize. (Abstract) In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. (Section 4, Experimental Results) |
| Researcher Affiliation | Academia | Tengyu Xu Department of Electrical and Computer Engineering The Ohio State University EMAIL Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York EMAIL Yingbin Liang Department of Electrical and Computer Engineering The Ohio State University EMAIL |
| Pseudocode | Yes | Algorithm 1 Blockwise Diminishing Stepsize TDC (Section 3.3) |
| Open Source Code | No | The paper does not provide any statement about releasing source code, nor does it include a link to a code repository. |
| Open Datasets | No | The paper uses 'Garnet problems [1]' which are described as a method for generating Markov decision processes. It specifies parameters for this generation (e.g., G(500, 20, 50, 20)), but does not refer to a publicly available or open dataset that can be accessed via a link, DOI, repository, or formal citation. |
| Dataset Splits | No | The paper conducts experiments in a simulated Markov Decision Process environment using generated problems, but it does not describe specific dataset splits (e.g., percentages, sample counts) for training, validation, or testing. |
| Hardware Specification | No | The paper does not specify any hardware components used for running the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not provide specific software dependencies, such as programming languages or libraries with version numbers, used for the experiments. |
| Experiment Setup | Yes | For all experiments, we choose θ0 = w0 = 0. (Section 4, Experimental Results) For diminishing stepsize, we set cα = cβ and σ = 3/2ν, and tune their values to the best, which are given by cα = cβ = 1.8, σ = 3/2ν = 0.45. For the four constant-stepsize cases, we fix α for each case, and tune β to the best. The resulting parameter settings are respectively as follows: αt = 0.01, βt = 0.006; αt = 0.02, βt = 0.008; αt = 0.05, βt = 0.02; and αt = 0.1, βt = 0.02. (Section 4.2, Constant Stepsize vs Diminishing Stepsize) |