reproducibilityindex.ai

On the Global Convergence Rates of Softmax Policy Gradient Methods

Authors: Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a O(1/t) rate, with constants depending on the problem and initialization. This result signiﬁcantly expands the recent asymptotic convergence results. The analysis relies on two ﬁndings: that the softmax policy gradient satisﬁes a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a signiﬁcantly faster linear convergence rate O(e t) toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new Ω(1/t) lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree.
Researcher Affiliation	Collaboration	Jincheng Mei 1 2 * Chenjun Xiao 1 Csaba Szepesv ari 3 1 Dale Schuurmans 2 1 Work done as an intern at Google Brain. 1University of Alberta 2Google Research, Brain Team 3Deep Mind.
Pseudocode	Yes	Algorithm 1 Policy Gradient Method Input: Learning rate η > 0. Initialize logit θ1(s, a) for all (s, a). for t = 1 to T do θt+1 θt + η V πθt (µ) θt . end for
Open Source Code	No	The paper does not provide any statements about code release or links to source code.
Open Datasets	No	The paper is theoretical and focuses on convergence rates in tabular MDPs and bandit problems. It does not mention using or providing access to any specific real-world or benchmark datasets for training.
Dataset Splits	No	The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset splits for training/validation/testing are discussed.
Hardware Specification	No	The paper is theoretical and does not mention any specific hardware (e.g., CPU, GPU models) used for computations or simulations.
Software Dependencies	No	The paper does not mention any specific software dependencies, libraries, or their version numbers.
Experiment Setup	Yes	Algorithm 1 specifies 'Input: Learning rate η > 0.' and the paper later defines specific values for η such as 'with η = 2/5' in Theorem 2 and Theorem 3, and 'η = (1 γ)3/8' in Theorem 4, and 'η = (1 γ)3/(8 + τ(4 + 8 log A))' in Theorem 6. These are hyperparameters or system-level training settings for the algorithms being analyzed.