Neural Policy Gradient Methods: Global Optimality and Rates of Convergence

Authors: Lingxiao Wang, Qi Cai, Zhuoran Yang, Zhaoran Wang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical However, it remains less clear whether such neural policy gradient methods converge to globally optimal policies and whether they even converge at all. We answer both the questions affirmatively under the overparameterized twolayer neural-network parameterization. In detail, assuming independent sampling, we prove that neural natural policy gradient converges to a globally optimal policy at a sublinear rate. Also, we show that neural vanilla policy gradient converges sublinearly to a stationary point. Meanwhile, by relating the suboptimality of the stationary points to the representation power of neural actor and critic classes, we prove the global optimality of all stationary points under mild regularity conditions. Particularly, we show that a key to the global optimality and convergence is the compatibility between the actor and critic, which is ensured by sharing neural architectures and random initializations across the actor and critic. To the best of our knowledge, our analysis establishes the first global optimality and convergence guarantees for neural policy gradient methods.
Researcher Affiliation Academia equal contribution Northwestern University; lingxiaowang2022@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com
Pseudocode Yes Algorithm 1 Neural Policy Gradient Methods Algorithm 2 Neural TD (Cai et al., 2019)
Open Source Code No The paper does not provide any concrete access to source code (e.g., a specific repository link or an explicit statement of code release) for the methodology described in this paper.
Open Datasets No The paper is theoretical and uses abstract mathematical models (Markov decision processes, state spaces, etc.) rather than specific empirical datasets for training or evaluation. Therefore, it does not provide access information for a public dataset.
Dataset Splits No The paper is theoretical and does not conduct experiments with empirical data. Therefore, it does not provide specific dataset split information (percentages, counts, or predefined splits) for reproduction.
Hardware Specification No The paper is theoretical and focuses on mathematical analysis. It does not describe any experiments that would require specific hardware, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and focuses on mathematical analysis of algorithms. It does not describe any software implementations or list specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and presents algorithms and their convergence properties. While it discusses parameters like learning rate and batch size within the theoretical framework, it does not provide specific hyperparameter values or system-level training settings for an empirical experiment setup.