Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Authors: Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the POLITEX algorithm (a version of regularized policy iteration) can be sharpened from O(T 3/4) to O( T) under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first highprobability O( T) regret bound for a computationally efficient algorithm in this setting. The exact implementation of POLITEX with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.
Researcher Affiliation Collaboration 1Deep Mind 2University of Alberta. Correspondence to: Nevena Lazic <nevena@google.com>.
Pseudocode Yes Algorithm 1 Schema for policy iteration with replay
Open Source Code No The paper does not provide a direct link to a code repository or explicitly state that the source code for their methodology is being released.
Open Datasets Yes We use two control environments with the simulators described in Tassa et al. (2018). Both environments are episodic with episode length 1000.
Dataset Splits No The paper conducts experiments in reinforcement learning environments but does not describe traditional train/validation/test dataset splits, which are not typical for this domain where data is collected interactively.
Hardware Specification Yes Every run is conducted on a single P100 GPU.
Software Dependencies No The paper mentions using neural networks and general tools but does not specify software dependencies with version numbers (e.g., PyTorch version, TensorFlow version, Python version, etc.).
Experiment Setup Yes We approximate Q-functions using neural networks with one hidden layer and Re LU activation: the width of the hidden layer is 50 for Cart-pole and 250 for Ball-in-cup... For Cart-pole, we choose phase length τ = 10^4 and for Ballin-cup, we choose τ = 2 * 10^4. For our algorithm and the variants of POLITEX, we choose parameter η in {5, 10, 20, 40, 80, 160} and report the result of each algorithm using the value of η that produces the best average reward. We treat the length of Monte Carlo estimate b as a hyper parameter, so we choose it in {100, 300} and report the best performance.