An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Authors: Yanli Liu, Kaiqing Zhang, Tamer Basar, Wotao Yin

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare the numerical performances of stochastic PG, NPG, SRVR-PG, and SRVR-NPG. Specifically, we test on benchmark reinforcement learning environments Cartpole and Mountain Car.
Researcher Affiliation Academia Department of Mathematics, University of California, Los Angeles Department of ECE and CSL, University of Illinois at Urbana-Champaign
Pseudocode Yes Algorithm 1 Stochastic Recursive Variance Reduced Natural Policy Gradient (SRVR-NPG)
Open Source Code Yes Our implementation is based on the implementation of SRVPG1 and SRVR-PG2, and can be found in the supplementary material. 1https://github.com/Dam930/rllab 2https://github.com/xgfelicia/SRVRPG
Open Datasets No The paper refers to benchmark reinforcement learning environments (Cartpole and Mountain Car) which are simulations rather than fixed public datasets with associated access information like URLs or formal citations for public availability.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions that the implementation is based on "rllab" and "SRVRPG" but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For both tasks, we apply a Gaussian policy of the form πθ(a | s) = 1 2π exp (µθ(s) a)2 , where the mean µθ(s) is modeled by a neural network with Tanh as the activation function. For the Cartpole problem, we apply a neural network of size 32 1 and a horizon of H = 100. In addition, each training algorithm uses 5000 trajectories in total. For the Mountain Car problem, we apply a neural network of size 64 1 and take H = 1000. 3000 trajectories are allowed for each algorithm. The numerical performance comparison, as well as the settings of algorithm-specific parameters, can be found in Figures 1 and 2. In App. O, we provide more implementation details.