Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs
Authors: Kaixuan Ji, Qingyue Zhao, Jiafan He, Weitong Zhang, Quanquan Gu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an e O (d + log |S|) K + d2 regret with full-information feedback1, where d is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, K is the number of episodes, |S| is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of log |S| in the regret bound. |
| Researcher Affiliation | Academia | Kaixuan Ji1 , Qingyue Zhao2 , Jiafan He1, Weitong Zhang1, Quanquan Gu1 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science and Technology, Tsinghua University |
| Pseudocode | Yes | Algorithm 1 HF-O2PS Algorithm 2 Mirror Descent on Occupancy Measure Algorithm 3 High-order moment estimator (HOME) |
| Open Source Code | No | The paper does not contain any statement about releasing open-source code or provide a link to a code repository. |
| Open Datasets | No | The paper is theoretical, focusing on algorithm design and regret analysis, and does not involve empirical training on datasets. Therefore, no information about publicly available datasets or access is provided. |
| Dataset Splits | No | The paper describes theoretical work (algorithm design and regret bounds) and does not present empirical experiments. Thus, there is no mention of dataset splits for validation. |
| Hardware Specification | No | The paper is theoretical and does not report on empirical experiments that would require specific hardware for execution. Therefore, no hardware specifications are mentioned. |
| Software Dependencies | No | The paper presents theoretical research, including algorithm design and proofs, rather than empirical experiments. Consequently, it does not list any specific software dependencies with version numbers for reproducing experimental results. |
| Experiment Setup | No | The paper is theoretical and does not describe empirical experiments. Therefore, there are no details regarding experimental setup, such as hyperparameters or training settings. |