Non-stationary Reinforcement Learning under General Function Approximation
Authors: Songtao Feng, Ming Yin, Ruiquan Huang, Yu-Xiang Wang, Jing Yang, Yingbin Liang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we make the first such an attempt. We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs, which subsumes majority of existing tractable RL problems in static MDPs as well as non-stationary MDPs. Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA, which features a sliding window mechanism and a new confidence set design for non-stationary MDPs. We then establish an upper bound on the dynamic regret for the proposed algorithm, and show that SW-OPEA is provably efficient as long as the variation budget is not significantly large. We further demonstrate via examples of non-stationary linear and tabular MDPs that our algorithm performs better in small variation budget scenario than the existing UCB-type algorithms. To the best of our knowledge, this is the first dynamic regret analysis in non-stationary MDPs with general function approximation. |
| Researcher Affiliation | Academia | 1The Ohio State Univsersity 2The University of California, Santa Barbara 3The Pennsylvania State University. |
| Pseudocode | Yes | Algorithm 1 SW-OPEA (Sliding Window Optimisticbased Exploration and Approximation under non-stationary MDPs) |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and analyzes algorithms based on abstract MDPs (e.g., non-stationary linear and tabular MDPs) rather than conducting empirical training on specific datasets. No dataset access information is provided. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments involving dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper does not mention any specific hardware used for computations or experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe empirical experiment setups, including hyperparameters or system-level training settings. |