The Role of Baselines in Policy Gradient Optimization
Authors: Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the state value baseline allows on-policy stochastic natural policy gradient (NPG) to converge to a globally optimal policy at an O(1/t) rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action s probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction cannot explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to reduce the aggressiveness of the updates rather than their variance. That is, we demonstrate that a finite variance is not necessary for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings. |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team 2Mila, Mc Gill University 3Mila, University of Montreal 4Deep Mind 5Amii, University of Alberta |
| Pseudocode | Yes | Algorithm 1 NPG, on-policy stochastic natural gradient |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | No | The paper does not specify the use of a publicly available dataset with concrete access information for training. It defines a one-state Markov Decision Process (MDP) for simulations and generalizes to finite MDPs, rather than using a pre-existing public dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets). It describes experimental settings for an MDP simulation, not data splits for a traditional dataset. |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. While it implies the use of standard machine learning libraries, it does not list them (e.g., 'Python 3.x, PyTorch 1.x'). |
| Experiment Setup | Yes | We first consider a one-state MDP with K = 20 actions and true mean reward vector r (0, 1)K, where the optimal action is a = 1 with true mean reward r(1) 0.97 and best sub-optimal action s true mean reward r(2) 0.95. The sampled reward is observed with a large noise, e.g., x 2.03 and x 3.97 with both 0.5 probability for the optimal action, such that r(1) 0.5 ( 2.03) + 0.5 3.97. ... To verify asymptotic convergence to a globally optimal policy in Lemma 2, we consider the iteration behaviors of Update 2 under an adversarial initialization, where πθ1(2) 0.88, i.e., a sub-optimal action starts with a dominating probability. ... We run Update 2 with a uniform initialization, i.e., πθ1(a) = 1/K for all a [K], and calculate averaged sub-optimality gap (π πθt) r across 20 independent runs, using deterministic reward settings where ˆrt is from Definition 2. |