Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

Authors: Long Yang, Qian Zheng, Gang Pan10630-10638

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our result shows that policy gradient converges to an (ϵ, ϵχ)SOSP with probability at least 1 e O(δ) after the total cost 2 (1 γ) χ log 1 2 ), where γ (0, 1). It sig- nificantly improves the state of the art cost e O(ϵ 9).Our analysis is based on the key idea that decomposes the parameter space Rp into three non-intersected regions: non-stationary point region, saddle point region, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods. For the complete proof, please refer to https://arxiv.org/pdf/2012.01491.pdf.
Researcher Affiliation Academia 1College of Computer Science and Technology, Zhejiang University, China. 2School of Electrical and Electronic Engineering, Nanyang Technological University,Singapore. 1 {yanglong,gpan}@zju.edu.cn; 2zhengqian@ntu.edu.sg
Pseudocode No The paper describes update rules like (5) (θk+1 = θk + α \ J(θk)), but these are equations within the text and not presented as formal pseudocode blocks or algorithms.
Open Source Code No The paper does not provide a statement or a link indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and does not describe the use of any datasets for training experiments. Example 1 describes a conceptual MDP, not a dataset used for empirical training.
Dataset Splits No The paper is theoretical and does not mention any training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and does not mention any hardware specifications used for experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on complexity analysis rather than detailing an experimental setup, hyperparameters, or training configurations.