Neural Temporal-Difference Learning Converges to Global Optima

Authors: Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In detail, we prove that randomly initialized neural TD converges to the global optimum of MSPBE at the rate of 1/T with population semigradients and at the rate of 1/ T with stochastic semigradients.
Researcher Affiliation Academia Department of Industrial Engineering and Management Sciences, Northwestern University Department of Operations Research and Financial Engineering, Princeton University Department of Electronic Engineering, Princeton University
Pseudocode Yes Algorithm 1 Neural TD
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No The paper is theoretical and focuses on proving convergence, not on empirical evaluation with specific datasets. It discusses theoretical concepts like 'stationary distribution of policy' but does not mention any publicly available datasets for training.
Dataset Splits No The paper is theoretical and does not involve empirical experiments with datasets. Therefore, there are no training, validation, or test dataset splits mentioned.
Hardware Specification No The paper is purely theoretical and does not describe any experimental setup, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe any experimental setup. Therefore, it does not list software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe an empirical experimental setup, including hyperparameters or system-level training settings.