reproducibilityindex.ai

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Authors: Yanli Liu, Kaiqing Zhang, Tamer Basar, Wotao Yin

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we compare the numerical performances of stochastic PG, NPG, SRVR-PG, and SRVR-NPG. Speciﬁcally, we test on benchmark reinforcement learning environments Cartpole and Mountain Car.
Researcher Affiliation	Academia	Department of Mathematics, University of California, Los Angeles Department of ECE and CSL, University of Illinois at Urbana-Champaign
Pseudocode	Yes	Algorithm 1 Stochastic Recursive Variance Reduced Natural Policy Gradient (SRVR-NPG)
Open Source Code	Yes	Our implementation is based on the implementation of SRVPG1 and SRVR-PG2, and can be found in the supplementary material. 1https://github.com/Dam930/rllab 2https://github.com/xgfelicia/SRVRPG
Open Datasets	No	The paper refers to benchmark reinforcement learning environments (Cartpole and Mountain Car) which are simulations rather than fixed public datasets with associated access information like URLs or formal citations for public availability.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions that the implementation is based on "rllab" and "SRVRPG" but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	For both tasks, we apply a Gaussian policy of the form πθ(a \| s) = 1 2π exp (µθ(s) a)2 , where the mean µθ(s) is modeled by a neural network with Tanh as the activation function. For the Cartpole problem, we apply a neural network of size 32 1 and a horizon of H = 100. In addition, each training algorithm uses 5000 trajectories in total. For the Mountain Car problem, we apply a neural network of size 64 1 and take H = 1000. 3000 trajectories are allowed for each algorithm. The numerical performance comparison, as well as the settings of algorithm-speciﬁc parameters, can be found in Figures 1 and 2. In App. O, we provide more implementation details.