reproducibilityindex.ai

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Authors: Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we study algorithms for learning in inﬁnite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We ﬁrst show that the regret analysis of the POLITEX algorithm (a version of regularized policy iteration) can be sharpened from O(T 3/4) to O( T) under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the ﬁrst highprobability O( T) regret bound for a computationally efﬁcient algorithm in this setting. The exact implementation of POLITEX with neural network function approximation is inefﬁcient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efﬁcient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justiﬁcation for using experience replay within policy iteration algorithms.
Researcher Affiliation	Collaboration	1Deep Mind 2University of Alberta. Correspondence to: Nevena Lazic <nevena@google.com>.
Pseudocode	Yes	Algorithm 1 Schema for policy iteration with replay
Open Source Code	No	The paper does not provide a direct link to a code repository or explicitly state that the source code for their methodology is being released.
Open Datasets	Yes	We use two control environments with the simulators described in Tassa et al. (2018). Both environments are episodic with episode length 1000.
Dataset Splits	No	The paper conducts experiments in reinforcement learning environments but does not describe traditional train/validation/test dataset splits, which are not typical for this domain where data is collected interactively.
Hardware Specification	Yes	Every run is conducted on a single P100 GPU.
Software Dependencies	No	The paper mentions using neural networks and general tools but does not specify software dependencies with version numbers (e.g., PyTorch version, TensorFlow version, Python version, etc.).
Experiment Setup	Yes	We approximate Q-functions using neural networks with one hidden layer and Re LU activation: the width of the hidden layer is 50 for Cart-pole and 250 for Ball-in-cup... For Cart-pole, we choose phase length τ = 10^4 and for Ballin-cup, we choose τ = 2 * 10^4. For our algorithm and the variants of POLITEX, we choose parameter η in {5, 10, 20, 40, 80, 160} and report the result of each algorithm using the value of η that produces the best average reward. We treat the length of Monte Carlo estimate b as a hyper parameter, so we choose it in {100, 300} and report the best performance.