Non-stationary Reinforcement Learning under General Function Approximation

Authors: Songtao Feng, Ming Yin, Ruiquan Huang, Yu-Xiang Wang, Jing Yang, Yingbin Liang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we make the first such an attempt. We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs, which subsumes majority of existing tractable RL problems in static MDPs as well as non-stationary MDPs. Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA, which features a sliding window mechanism and a new confidence set design for non-stationary MDPs. We then establish an upper bound on the dynamic regret for the proposed algorithm, and show that SW-OPEA is provably efficient as long as the variation budget is not significantly large. We further demonstrate via examples of non-stationary linear and tabular MDPs that our algorithm performs better in small variation budget scenario than the existing UCB-type algorithms. To the best of our knowledge, this is the first dynamic regret analysis in non-stationary MDPs with general function approximation.
Researcher Affiliation Academia 1The Ohio State Univsersity 2The University of California, Santa Barbara 3The Pennsylvania State University.
Pseudocode Yes Algorithm 1 SW-OPEA (Sliding Window Optimisticbased Exploration and Approximation under non-stationary MDPs)
Open Source Code No The paper does not provide any statement or link indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and analyzes algorithms based on abstract MDPs (e.g., non-stationary linear and tabular MDPs) rather than conducting empirical training on specific datasets. No dataset access information is provided.
Dataset Splits No The paper is theoretical and does not describe empirical experiments involving dataset splits for training, validation, or testing.
Hardware Specification No The paper does not mention any specific hardware used for computations or experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe empirical experiment setups, including hyperparameters or system-level training settings.