On the Global Convergence Rates of Softmax Policy Gradient Methods

Authors: Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a O(1/t) rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate O(e t) toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new Ω(1/t) lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree.
Researcher Affiliation Collaboration Jincheng Mei 1 2 * Chenjun Xiao 1 Csaba Szepesv ari 3 1 Dale Schuurmans 2 1 Work done as an intern at Google Brain. 1University of Alberta 2Google Research, Brain Team 3Deep Mind.
Pseudocode Yes Algorithm 1 Policy Gradient Method Input: Learning rate η > 0. Initialize logit θ1(s, a) for all (s, a). for t = 1 to T do θt+1 θt + η V πθt (µ) θt . end for
Open Source Code No The paper does not provide any statements about code release or links to source code.
Open Datasets No The paper is theoretical and focuses on convergence rates in tabular MDPs and bandit problems. It does not mention using or providing access to any specific real-world or benchmark datasets for training.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset splits for training/validation/testing are discussed.
Hardware Specification No The paper is theoretical and does not mention any specific hardware (e.g., CPU, GPU models) used for computations or simulations.
Software Dependencies No The paper does not mention any specific software dependencies, libraries, or their version numbers.
Experiment Setup Yes Algorithm 1 specifies 'Input: Learning rate η > 0.' and the paper later defines specific values for η such as 'with η = 2/5' in Theorem 2 and Theorem 3, and 'η = (1 γ)3/8' in Theorem 4, and 'η = (1 γ)3/(8 + τ(4 + 8 log A))' in Theorem 6. These are hyperparameters or system-level training settings for the algorithms being analyzed.