reproducibilityindex.ai

Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy

Authors: Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of inﬁnite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the inﬁnite-dimensional gradient and iterate. and By answering questions (i)-(iii), we establish the ﬁrst nonasymptotic global rate of convergence of a variant of PPO (and TRPO) equipped with neural networks. In detail, we prove that, with policy and action-value function parametrized by randomly initialized and overparametrized two-layer neural networks, PPO converges to the optimal policy at the rate of O(1/ K), where K is the number of iterations.
Researcher Affiliation	Academia	equal contribution Northwestern University; boyiliu2018@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com
Pseudocode	Yes	Algorithm 1 Neural PPO
Open Source Code	No	The paper does not include any statement or link indicating the release of source code for the described methodology.
Open Datasets	No	This paper is theoretical and does not describe experiments involving datasets for training.
Dataset Splits	No	This paper is theoretical and does not describe experimental validation or dataset splitting.
Hardware Specification	No	The paper focuses on theoretical proofs and does not mention any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not describe any specific software dependencies with version numbers for replication.
Experiment Setup	No	The paper focuses on theoretical analysis and does not provide details on experimental setup such as hyperparameters or training configurations.