Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy

Authors: Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate. and By answering questions (i)-(iii), we establish the first nonasymptotic global rate of convergence of a variant of PPO (and TRPO) equipped with neural networks. In detail, we prove that, with policy and action-value function parametrized by randomly initialized and overparametrized two-layer neural networks, PPO converges to the optimal policy at the rate of O(1/ K), where K is the number of iterations.
Researcher Affiliation Academia equal contribution Northwestern University; boyiliu2018@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com
Pseudocode Yes Algorithm 1 Neural PPO
Open Source Code No The paper does not include any statement or link indicating the release of source code for the described methodology.
Open Datasets No This paper is theoretical and does not describe experiments involving datasets for training.
Dataset Splits No This paper is theoretical and does not describe experimental validation or dataset splitting.
Hardware Specification No The paper focuses on theoretical proofs and does not mention any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not describe any specific software dependencies with version numbers for replication.
Experiment Setup No The paper focuses on theoretical analysis and does not provide details on experimental setup such as hyperparameters or training configurations.