Dynamic Regret of Adversarial Linear Mixture MDPs
Authors: Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the dynamic regret as the performance measure. Denote by d the dimension of the feature mapping, H the length of each episode, K the number of episodes, PT the non-stationary measure, we propose a novel algorithm that enjoys an e O H4(K + PT )(1 + PT ) dynamic regret under the condition that PT is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an Ω HK(H + PT ) lower bound, indicating our algorithm is optimal in K and PT . |
| Researcher Affiliation | Academia | Long-Fei Li, Peng Zhao, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China {lilf, zhaop, zhouzh}@lamda.nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 POWERS-Fix Share; Algorithm 2 POWERS-Fix Share-On E |
| Open Source Code | No | The paper does not contain any statement about releasing source code or provide a link to a code repository for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments on a specific dataset. Thus, there is no mention of a public dataset or its accessibility. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments with datasets, therefore it does not provide any training/validation/test splits. |
| Hardware Specification | No | The paper describes theoretical work and algorithm design; it does not report on empirical experiments, therefore no hardware specifications are mentioned. |
| Software Dependencies | No | The paper describes theoretical work and algorithm design; it does not report on empirical experiments requiring specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and analysis, not empirical experimentation. The section titled 'Problem Setup' describes the mathematical model, not a practical experimental configuration. No hyperparameters or system-level training settings are provided. |