reproducibilityindex.ai

The Role of Baselines in Policy Gradient Optimization

Authors: Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our ﬁrst contribution is to show that the state value baseline allows on-policy stochastic natural policy gradient (NPG) to converge to a globally optimal policy at an O(1/t) rate, which was not previously known. The analysis relies on two novel ﬁndings: the expected progress of the NPG update satisﬁes a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action s probability from vanishing, thus ensuring sufﬁcient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we ﬁnd that variance reduction cannot explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to reduce the aggressiveness of the updates rather than their variance. That is, we demonstrate that a ﬁnite variance is not necessary for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufﬁcient. Additional experimental results verify these theoretical ﬁndings.
Researcher Affiliation	Collaboration	1Google Research, Brain Team 2Mila, Mc Gill University 3Mila, University of Montreal 4Deep Mind 5Amii, University of Alberta
Pseudocode	Yes	Algorithm 1 NPG, on-policy stochastic natural gradient
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	No	The paper does not specify the use of a publicly available dataset with concrete access information for training. It defines a one-state Markov Decision Process (MDP) for simulations and generalizes to finite MDPs, rather than using a pre-existing public dataset.
Dataset Splits	No	The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets). It describes experimental settings for an MDP simulation, not data splits for a traditional dataset.
Hardware Specification	No	Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers. While it implies the use of standard machine learning libraries, it does not list them (e.g., 'Python 3.x, PyTorch 1.x').
Experiment Setup	Yes	We ﬁrst consider a one-state MDP with K = 20 actions and true mean reward vector r (0, 1)K, where the optimal action is a = 1 with true mean reward r(1) 0.97 and best sub-optimal action s true mean reward r(2) 0.95. The sampled reward is observed with a large noise, e.g., x 2.03 and x 3.97 with both 0.5 probability for the optimal action, such that r(1) 0.5 ( 2.03) + 0.5 3.97. ... To verify asymptotic convergence to a globally optimal policy in Lemma 2, we consider the iteration behaviors of Update 2 under an adversarial initialization, where πθ1(2) 0.88, i.e., a sub-optimal action starts with a dominating probability. ... We run Update 2 with a uniform initialization, i.e., πθ1(a) = 1/K for all a [K], and calculate averaged sub-optimality gap (π πθt) r across 20 independent runs, using deterministic reward settings where ˆrt is from Deﬁnition 2.