Achieving Constant Regret in Linear Markov Decision Processes

Authors: Weitong Zhang, Zhiyuan Fan, Jiafan He, Quanquan Gu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level ζ. At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. We provide a detailed proof for all theorems in the appendix, starting from Appendix C. Our submission paper does not include experiments.
Researcher Affiliation Academia Weitong Zhang School of Data Science and Society University of North Carolina at Chapel Hill Chapel Hill, NC 27599 weitongz@unc.edu Zhiyuan Fan EECS Massachusetts Institute of Technology Cambridge, MA 02139 fanzy@mit.edu Jiafan He Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 jiafanhe19@ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 Cert-LSVI-UCB Algorithm 2 Cert-Lin UCB : s; {ewk h,l}l, { e Uk, 1 h,l }l, L 7 b V k h (s), πk h(s), lk h(s), f k h(s)
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets No The paper uses "synthetic datasets" for numerical simulation, but it does not refer to a publicly available or open dataset that can be accessed via a link, DOI, or specific citation.
Dataset Splits No The paper does not explicitly provide dataset splits for training, validation, or testing, as it uses synthetic data generation methods rather than pre-defined public datasets.
Hardware Specification No The paper's "I Numerical Simulation" section describes the simulation setup but does not specify any hardware details like CPU, GPU models, or memory. It primarily describes the environment parameters and simulation methodology.
Software Dependencies No The paper's "I Numerical Simulation" section describes the simulation setup but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, specific libraries).
Experiment Setup Yes Specifically, we consider a linear MDP with S = 4, A = 5, H = 2, and d = 8. Each element in the feature vector ϕ(s, a) and µ(s ) is generated by a uniform distribution U(0, 1). ... The reward is defined by r(s, a) = ϕ (s, a)θ, where θ N(0, Id). The model misspecification is also added to the transition P and reward function r. ... We investigated the misspecification level from ζ = 0, 0.01, ..., 0.3 in 16 randomly generated environments over 2000 episodes.