Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples
Authors: Tengyu Xu, Shaofeng Zou, Yingbin Liang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize. (Abstract) In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. (Section 4, Experimental Results) |
| Researcher Affiliation | Academia | Tengyu Xu Department of Electrical and Computer Engineering The Ohio State University xu.3260@osu.edu Shaofeng Zou Department of Electrical Engineering University at Buffalo, The State University of New York szou3@buffalo.edu Yingbin Liang Department of Electrical and Computer Engineering The Ohio State University liang.889@osu.edu |
| Pseudocode | Yes | Algorithm 1 Blockwise Diminishing Stepsize TDC (Section 3.3) |
| Open Source Code | No | The paper does not provide any statement about releasing source code, nor does it include a link to a code repository. |
| Open Datasets | No | The paper uses 'Garnet problems [1]' which are described as a method for generating Markov decision processes. It specifies parameters for this generation (e.g., G(500, 20, 50, 20)), but does not refer to a publicly available or open dataset that can be accessed via a link, DOI, repository, or formal citation. |
| Dataset Splits | No | The paper conducts experiments in a simulated Markov Decision Process environment using generated problems, but it does not describe specific dataset splits (e.g., percentages, sample counts) for training, validation, or testing. |
| Hardware Specification | No | The paper does not specify any hardware components used for running the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not provide specific software dependencies, such as programming languages or libraries with version numbers, used for the experiments. |
| Experiment Setup | Yes | For all experiments, we choose θ0 = w0 = 0. (Section 4, Experimental Results) For diminishing stepsize, we set cα = cβ and σ = 3/2ν, and tune their values to the best, which are given by cα = cβ = 1.8, σ = 3/2ν = 0.45. For the four constant-stepsize cases, we fix α for each case, and tune β to the best. The resulting parameter settings are respectively as follows: αt = 0.01, βt = 0.006; αt = 0.02, βt = 0.008; αt = 0.05, βt = 0.02; and αt = 0.1, βt = 0.02. (Section 4.2, Constant Stepsize vs Diminishing Stepsize) |