On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
Authors: Junyu Zhang, Chengzhuo Ni, zheng Yu, Csaba Szepesvari, Mengdi Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Numerical Experiments In this experiment, we aim to evaluate the performance of the TSIVR-PG algorithm for maximizing the cumulative sum of reward. As the benchmarks, we also implement the SVRPG [49], the SRVRPG [48], the HSPGA [33], and the REINFORCE [47] algorithms. Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8], |
| Researcher Affiliation | Academia | Junyu Zhang Department of Industrial Systems Engineering and Management National University of Singapore Singapore, 119077 junyuz@nus.edu.sg Chengzhuo Ni Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 chengzhuo.ni@princeton.edu Zheng Yu Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 zhengy@princeton.edu Csaba Szepesvari Department of Computer Science University of Alberta Edmonton, Alberta, Canada T6G 2E8 szepesva@ualberta.ca Mengdi Wang Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 mengdiw@princeton.edu |
| Pseudocode | Yes | Algorithm 1: The TSIVR-PG Algorithm |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8], which is a well-known toolkit for developing and comparing reinforcement learning algorithms. |
| Dataset Splits | No | The paper describes using standard RL environments but does not provide specific details on train/validation/test dataset splits, percentages, or methodologies for partitioning data to reproduce the experiment's data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI gym [8]' but does not provide specific version numbers for this or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | For all the algorithms, their batch sizes are chosen according to their theory. In details, let ϵ be any target accuracy. For both TSIVR-PG and SRVR-PG, we set N = Θ(ϵ 2), B = m = Θ(ϵ 1). For SVRPG, we set N = Θ(ϵ 2), B = Θ(ϵ 4/3) and m = Θ(ϵ 2/3). For HSPGA, we set B = Θ(ϵ 1), other parameters are calculated according to formulas in [33] given B. For REINFORCE, we set the batchsize to be N = Θ(ϵ 2). The parameter ε and the stepsize/learning rate are tuned for each individual algorithm using a grid search. For both environments, we use a neural network with two hidden layers with width 64 for both layers to model the policy. We choose σ = 0.125 in our experiment. |