Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy
Authors: Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate. and By answering questions (i)-(iii), we establish the first nonasymptotic global rate of convergence of a variant of PPO (and TRPO) equipped with neural networks. In detail, we prove that, with policy and action-value function parametrized by randomly initialized and overparametrized two-layer neural networks, PPO converges to the optimal policy at the rate of O(1/ K), where K is the number of iterations. |
| Researcher Affiliation | Academia | equal contribution Northwestern University; boyiliu2018@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com |
| Pseudocode | Yes | Algorithm 1 Neural PPO |
| Open Source Code | No | The paper does not include any statement or link indicating the release of source code for the described methodology. |
| Open Datasets | No | This paper is theoretical and does not describe experiments involving datasets for training. |
| Dataset Splits | No | This paper is theoretical and does not describe experimental validation or dataset splitting. |
| Hardware Specification | No | The paper focuses on theoretical proofs and does not mention any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not describe any specific software dependencies with version numbers for replication. |
| Experiment Setup | No | The paper focuses on theoretical analysis and does not provide details on experimental setup such as hyperparameters or training configurations. |