The Pitfalls of Regularization in Off-Policy TD Learning

Authors: Gaurav Manek, J. Zico Kolter

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce a series of new counterexamples to show that the instability and unbounded error of TD methods is not solved by regularization. We demonstrate that, in the offpolicy setting with linear function approximation, TD methods can fail to learn a non-trivial value function under any amount of regularization; we further show that regularization can induce divergence under common conditions; and we show that one of the most promising methods to mitigate this divergence (Emphatic TD algorithms) may also diverge under regularization. We further demonstrate such divergence when using neural networks as function approximators. Thus, we argue that the role of regularization in TD methods needs to be reconsidered, given that it is insufficient to prevent divergence and may itself introduce instability. There needs to be much more care in the practical and theoretical application of regularization to RL methods. ... We finally also illustrate misbehaving regularization in the context of neural network value function approximation, demonstrating the general pitfalls of regularization possible in RL algorithms. ... This is not merely theoretical we demonstrate this in the neural network case in Section 3.4. ... Example 4. Vacuous models and small-η error also occur in neural network conditions. Details. We train 100 models using simple semi-gradient TD updates under a fixed learning rate.
Researcher Affiliation Academia Gaurav Manek Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 gmanek@cs.cmu.edu J. Zico Kolter Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 zkolter@cs.cmu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] ... Did you include any new assets either in the supplemental material or as a URL? [Yes]
Open Datasets No The paper uses constructed Markov processes (e.g., 'three-state MP in Figure 1a', 'nine-state Markov chain (shown in Figure 1b)'). It does not refer to external public datasets with access information (links, DOIs, or specific citations).
Dataset Splits No The paper does not explicitly provide specific train/validation/test dataset splits. While it mentions training models (e.g., 'We train 100 models using simple semi-gradient TD updates'), it does not specify how the data for these models was partitioned.
Hardware Specification No The paper states: 'The total compute time to replicate all results in this paper is less than one CPU-hour.' This mentions compute time but does not provide specific hardware details (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup Yes We train 100 models using simple semi-gradient TD updates under a fixed learning rate. We use a 9-state variant of our example... We train a simple two-layer neural network with 3 neurons in the hidden layer. The value function is assigned pseudo-randomly in range [-1, 1]. (See Appendix C for details.)