Sample Complexity of Policy Gradient Finding Second-Order Stationary Points
Authors: Long Yang, Qian Zheng, Gang Pan10630-10638
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our result shows that policy gradient converges to an (ϵ, ϵχ)SOSP with probability at least 1 e O(δ) after the total cost 2 (1 γ) χ log 1 2 ), where γ (0, 1). It sig- nificantly improves the state of the art cost e O(ϵ 9).Our analysis is based on the key idea that decomposes the parameter space Rp into three non-intersected regions: non-stationary point region, saddle point region, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods. For the complete proof, please refer to https://arxiv.org/pdf/2012.01491.pdf. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Zhejiang University, China. 2School of Electrical and Electronic Engineering, Nanyang Technological University,Singapore. 1 {yanglong,gpan}@zju.edu.cn; 2zhengqian@ntu.edu.sg |
| Pseudocode | No | The paper describes update rules like (5) (θk+1 = θk + α \ J(θk)), but these are equations within the text and not presented as formal pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not provide a statement or a link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and does not describe the use of any datasets for training experiments. Example 1 describes a conceptual MDP, not a dataset used for empirical training. |
| Dataset Splits | No | The paper is theoretical and does not mention any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not mention any hardware specifications used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on complexity analysis rather than detailing an experimental setup, hyperparameters, or training configurations. |