Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

Authors: Yufeng Zhang, Qi Cai, Zhuoran Yang, Yongxin Chen, Zhaoran Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that, utilizing an overparameterized two-layer neural network, temporaldifference and Q-learning globally minimize the mean-squared projected Bellman error at a sublinear rate. Moreover, the associated feature representation converges to the optimal one, generalizing the previous analysis of [21] in the neural tangent kernel regime, where the associated feature representation stabilizes at the initial one. The key to our analysis is a mean-field perspective
Researcher Affiliation Academia Yufeng Zhang Northwestern University Evanston, IL 60208 yufengzhang2023@u.northwestern.edu Qi Cai Northwestern University Evanston, IL 60208 qicai2022@u.northwestern.edu Zhuoran Yang Princeton University Princeton, NJ 08544 zy6@princeton.edu Yongxin Chen Georgia Institute of Technology Atlanta, GA 30332 yongchen@gatech.edu Zhaoran Wang Northwestern University Evanston, IL 60208 zhaoranwang@gmail.com
Pseudocode Yes For an initial distribution 0 2 P(RD), we initialize { i}m i.i.d. 0 (i 2 [m]). See Algorithm 1 in A for a detailed description.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and does not conduct experiments or use datasets, thus no information about public dataset availability is provided.
Dataset Splits No The paper is theoretical and does not conduct experiments with datasets, thus no information about training/validation/test splits is provided.
Hardware Specification No The paper is theoretical and does not describe any experiments that would require hardware specifications.
Software Dependencies No The paper is theoretical and does not describe an experimental setup that would involve software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.