Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces

Authors: Yaqi Duan, Martin J. Wainwright

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes. Our analysis also offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL. ... The Mountain Car problem, a benchmark continuous control task, illustrates the acceleration phenomenon and underlying stability. ... We employed fitted Q-iteration (FQI) with carefully selected linear basis functions to derive near-optimal policies with off-line data. This learning procedure exhibits a value sub-optimality decay at a rate of 1/n, a significant improvement over the classical rate of 1/ n, as detailed in Figure 1(b).
Researcher Affiliation Academia Yaqi Duan Department of Technology, Operations, and Statistics Stern School of Business, New York University New York, NY 10012 yaqi.duan@stern.nyu.edu Martin J. Wainwright Laboratory for Information and Decision Systems, Statistics and Data Science Center Department of Electrical Engineering and Computer Science, and Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 wainwrigwork@gmail.com
Pseudocode No The paper describes the Fitted Q-iteration (FQI) process textually in Appendix D.2 but does not provide a formal pseudocode block or algorithm box.
Open Source Code No The paper does not explicitly state that its source code is available or provide a link to a repository in its main text or appendices. The NeurIPS checklist indicates supplementary code, but this is not stated within the paper's content itself.
Open Datasets No Our experiments were based on an off-line dataset consisting of n i.i.d. tuples D = (si, ai, ri, s i) n i=1 S A R S, where the state-action pairs (si, ai) = (pi, vi, fi) n i=1 were generated from a uniform distribution over the cube [pmin, pmax] [vmin, vmax] [fmin, fmax].
Dataset Splits No The paper describes the generation of its own dataset and how policies were evaluated, but it does not specify explicit training/validation/test splits of the generated dataset D for model training, in the traditional supervised learning sense. It describes evaluating the learned policy on simulated trajectories.
Hardware Specification Yes The experiment ran for 3 days on two laptops, each equipped with an Apple M2 Pro CPU and 16 GB RAM.
Software Dependencies No The paper mentions using "fitted Q-iteration (FQI) with linear function approximation" and "ridge regression" but does not specify software dependencies with version numbers (e.g., Python version, specific library versions).
Experiment Setup Yes Linear function approximation We approximate the the optimal Q-function (s, a) 7 Q (s, a) using a d-dimensional linear function class with d = 3000 features. ... The FQI process begins by initializing the weight vector as w0 : = 0 R3000. ... wt+1 : = arg min w R3000 yi w, ϕ(si, ai) 2 + Λn w 2 2 o , (28) where Λn = 0.01 n in all experiments reported here. We terminate the procedure after at most 500 iterations, or when there have been 5 consecutive iterations with insignificant improvements in weights, where insignificant means that wt+1 wt 2 / 3000 < 0.005.