Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity

Authors: Shaocong Ma, Ziyi Chen, Yi Zhou, Shaofeng Zou

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of O(ϵ 2). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.
Researcher Affiliation Academia Shaocong Ma, Ziyi Chen & Yi Zhou Department of ECE University of Utah Salt Lake City, UT 84112 {s.ma,u1276972,yi.zhou}@utah.edu Shaofeng Zou Department of EE University at Buffalo Buffalo, NY 14260 szou3@buffalo.edu
Pseudocode Yes Algorithm 1: Variance-Reduced Greedy-GQ
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets No The paper uses the Garnet problem and Frozen Lake game, which are well-known reinforcement learning environments. However, it describes generating features and setting up parameters for these problems, rather than providing a link or specific citation to a publicly available dataset used for training. For example, it states: "We set n S = 5, n A = 3, b = 2, d = 4 and generate the features Φ Rn S d via the uniform distribution on [0, 1]" for Garnet, and "We generate a Gaussian feature matrix with dimension 8" for Frozen Lake, indicating the data was generated specifically for their experiments without providing public access to their generated data.
Dataset Splits No The paper does not explicitly provide details about a validation dataset split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library names with version numbers).
Experiment Setup Yes For the Garnet problem, we set n S = 5, n A = 3, b = 2, d = 4... The discount factor is set to be γ = 0.95. ... We set the default learning rates as ηθ = 0.02 and ηω = 0.01 for both VR-Greedy-GQ and Greedy-GQ algorithm. For VR-Greedy-GQ, we set the default batch size as M = 3000. ... We set the learning rates as ηθ = 0.2 and ηω = 0.1 for both algorithms and set the batch size as M = 3000 for the VR-Greedy-GQ. We run 200k iterations for each of the 10 trajectories.