Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity
Authors: Shaocong Ma, Ziyi Chen, Yi Zhou, Shaofeng Zou
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of O(ϵ 2). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings. |
| Researcher Affiliation | Academia | Shaocong Ma, Ziyi Chen & Yi Zhou Department of ECE University of Utah Salt Lake City, UT 84112 {s.ma,u1276972,yi.zhou}@utah.edu Shaofeng Zou Department of EE University at Buffalo Buffalo, NY 14260 szou3@buffalo.edu |
| Pseudocode | Yes | Algorithm 1: Variance-Reduced Greedy-GQ |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | No | The paper uses the Garnet problem and Frozen Lake game, which are well-known reinforcement learning environments. However, it describes generating features and setting up parameters for these problems, rather than providing a link or specific citation to a publicly available dataset used for training. For example, it states: "We set n S = 5, n A = 3, b = 2, d = 4 and generate the features Φ Rn S d via the uniform distribution on [0, 1]" for Garnet, and "We generate a Gaussian feature matrix with dimension 8" for Frozen Lake, indicating the data was generated specifically for their experiments without providing public access to their generated data. |
| Dataset Splits | No | The paper does not explicitly provide details about a validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library names with version numbers). |
| Experiment Setup | Yes | For the Garnet problem, we set n S = 5, n A = 3, b = 2, d = 4... The discount factor is set to be γ = 0.95. ... We set the default learning rates as ηθ = 0.02 and ηω = 0.01 for both VR-Greedy-GQ and Greedy-GQ algorithm. For VR-Greedy-GQ, we set the default batch size as M = 3000. ... We set the learning rates as ηθ = 0.2 and ηω = 0.1 for both algorithms and set the batch size as M = 3000 for the VR-Greedy-GQ. We run 200k iterations for each of the 10 trajectories. |