Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Authors: Kaixuan Ji, Qingyue Zhao, Jiafan He, Weitong Zhang, Quanquan Gu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an e O (d + log |S|) K + d2 regret with full-information feedback1, where d is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, K is the number of episodes, |S| is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of log |S| in the regret bound.
Researcher Affiliation Academia Kaixuan Ji1 , Qingyue Zhao2 , Jiafan He1, Weitong Zhang1, Quanquan Gu1 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science and Technology, Tsinghua University
Pseudocode Yes Algorithm 1 HF-O2PS Algorithm 2 Mirror Descent on Occupancy Measure Algorithm 3 High-order moment estimator (HOME)
Open Source Code No The paper does not contain any statement about releasing open-source code or provide a link to a code repository.
Open Datasets No The paper is theoretical, focusing on algorithm design and regret analysis, and does not involve empirical training on datasets. Therefore, no information about publicly available datasets or access is provided.
Dataset Splits No The paper describes theoretical work (algorithm design and regret bounds) and does not present empirical experiments. Thus, there is no mention of dataset splits for validation.
Hardware Specification No The paper is theoretical and does not report on empirical experiments that would require specific hardware for execution. Therefore, no hardware specifications are mentioned.
Software Dependencies No The paper presents theoretical research, including algorithm design and proofs, rather than empirical experiments. Consequently, it does not list any specific software dependencies with version numbers for reproducing experimental results.
Experiment Setup No The paper is theoretical and does not describe empirical experiments. Therefore, there are no details regarding experimental setup, such as hyperparameters or training settings.