Non-Asymptotic Analysis for Two Time-scale TDC with General Smooth Function Approximation

Authors: Yue Wang, Shaofeng Zou, Yi Zhou

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we develop novel techniques to address the above challenges and explicitly characterize the non-asymptotic error bound for the general off-policy setting with i.i.d. or Markovian samples, and show that it converges as fast as O(1/ T) (up to a factor of O(log T)). Our approach can be applied to a wide range of value-based reinforcement learning algorithms with general smooth function approximation.
Researcher Affiliation Academia Yue Wang Department of Electrical Engineering University at Buffalo Buffalo, NY, USA ywang294@buffalo.edu Shaofeng Zou Department of Electrical Engineering University at Buffalo Buffalo, NY, USA szou3@buffalo.edu Yi Zhou Department of Electrical and Computer Engineering University of Utah Salt Lake City, Utah, USA yi.zhou@utah.edu
Pseudocode Yes Algorithm 1 Non-Linear Off-Policy TDC under the Markovian Setting
Open Source Code No The paper does not provide any statement or link regarding the availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical and analyzes algorithms with samples generated from a Markov Decision Process, but it does not specify or provide access information for any publicly available or open datasets used for training.
Dataset Splits No The paper is theoretical and does not conduct experiments on specific datasets, therefore, it does not provide training/test/validation dataset splits.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any ancillary software or library versions used for experiments.
Experiment Setup No The paper is theoretical and analyzes an algorithm but does not provide details on hyperparameter values or system-level training settings, as it does not conduct empirical experiments.