On Double Descent in Reinforcement Learning with LSTD and Random Features

Authors: David Brellmann, Eloïse Berthier, David Filliat, Goran Frehse

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a theoretical analysis of the influence of network size and l2-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define overparameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Squared Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the l2-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
Researcher Affiliation Academia David Brellmann, Elo ıse Berthier, David Filliat & Goran Frehse U2IS, ENSTA Paris, Institut Polytechnique de Paris, Palaiseau, FRANCE {first name.last name}@ensta-paris.fr
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using 'recursive regularized LSTD implementation of Dann et al. (2014)' but does not provide its own implementation.
Open Datasets No The paper states 'Dtrain := {(si, ri, s i)}n i=1 is derived from a sample path of n transitions' and refers to 'synthetic ergodic MRP', 'gridworld MRP', and 'Open AI gym Taxi-v3 (Towers et al., 2023)'. While the environments are mentioned, it doesn't provide specific links, DOIs, repositories, or formal citations for a pre-existing, downloadable dataset of transitions for direct public access.
Dataset Splits No The paper describes the generation of training data ('Dtrain') from sample paths but does not specify any training/test/validation dataset splits (percentages or counts) or reference predefined splits needed for reproduction.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using the 'recursive regularized LSTD implementation of Dann et al. (2014)' and refers to 'Open AI gym Taxi-v3 (Towers et al., 2023)' and 'Re LU function' but does not specify software names with version numbers for libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Experimental Setup. We use the recursive regularized LSTD implementation of Dann et al. (2014) on three MRPs: a synthetic ergodic MRP (500 states); a gridworld MRP (400 states) obtained from a random policy in a 20 20 gridworld (Ahmed, 2018); and a Taxi-v3 MRP (356 states) obtained from a learned policy in Open AI gym Taxi-v3 (Towers et al., 2023) (Figure 1a). In all cases, states are described by d-Gaussian vectors where d = 50. For the random features, W is drawn from a Gaussian distribution and σ( ) = max(0, ) is the Re LU function. For all experiments, Dtrain := {(si, ri, s i)}n i=1 is derived from a sample path of n transitions with the same seed (42). For each instance i, we sample random features using the seed i. The following graphs show averages over 30 instances.