On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

Authors: Junyu Zhang, Chengzhuo Ni, zheng Yu, Csaba Szepesvari, Mengdi Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Numerical Experiments In this experiment, we aim to evaluate the performance of the TSIVR-PG algorithm for maximizing the cumulative sum of reward. As the benchmarks, we also implement the SVRPG [49], the SRVRPG [48], the HSPGA [33], and the REINFORCE [47] algorithms. Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8],
Researcher Affiliation Academia Junyu Zhang Department of Industrial Systems Engineering and Management National University of Singapore Singapore, 119077 junyuz@nus.edu.sg Chengzhuo Ni Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 chengzhuo.ni@princeton.edu Zheng Yu Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 zhengy@princeton.edu Csaba Szepesvari Department of Computer Science University of Alberta Edmonton, Alberta, Canada T6G 2E8 szepesva@ualberta.ca Mengdi Wang Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 mengdiw@princeton.edu
Pseudocode Yes Algorithm 1: The TSIVR-PG Algorithm
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8], which is a well-known toolkit for developing and comparing reinforcement learning algorithms.
Dataset Splits No The paper describes using standard RL environments but does not provide specific details on train/validation/test dataset splits, percentages, or methodologies for partitioning data to reproduce the experiment's data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'Open AI gym [8]' but does not provide specific version numbers for this or any other software dependencies, which would be necessary for reproducibility.
Experiment Setup Yes For all the algorithms, their batch sizes are chosen according to their theory. In details, let ϵ be any target accuracy. For both TSIVR-PG and SRVR-PG, we set N = Θ(ϵ 2), B = m = Θ(ϵ 1). For SVRPG, we set N = Θ(ϵ 2), B = Θ(ϵ 4/3) and m = Θ(ϵ 2/3). For HSPGA, we set B = Θ(ϵ 1), other parameters are calculated according to formulas in [33] given B. For REINFORCE, we set the batchsize to be N = Θ(ϵ 2). The parameter ε and the stepsize/learning rate are tuned for each individual algorithm using a grid search. For both environments, we use a neural network with two hidden layers with width 64 for both layers to model the policy. We choose σ = 0.125 in our experiment.