reproducibilityindex.ai

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Authors: Kaixuan Ji, Qingyue Zhao, Jiafan He, Weitong Zhang, Quanquan Gu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward over episodes, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a stochastic policy. We show that our algorithm achieves an e O (d + log \|S\|) K + d2 regret with full-information feedback1, where d is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, K is the number of episodes, \|S\| is the cardinality of the state space. We also provide hardness results to justify the near optimality of our algorithm and the inevitability of log \|S\| in the regret bound.
Researcher Affiliation	Academia	Kaixuan Ji1 , Qingyue Zhao2 , Jiafan He1, Weitong Zhang1, Quanquan Gu1 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science and Technology, Tsinghua University
Pseudocode	Yes	Algorithm 1 HF-O2PS Algorithm 2 Mirror Descent on Occupancy Measure Algorithm 3 High-order moment estimator (HOME)
Open Source Code	No	The paper does not contain any statement about releasing open-source code or provide a link to a code repository.
Open Datasets	No	The paper is theoretical, focusing on algorithm design and regret analysis, and does not involve empirical training on datasets. Therefore, no information about publicly available datasets or access is provided.
Dataset Splits	No	The paper describes theoretical work (algorithm design and regret bounds) and does not present empirical experiments. Thus, there is no mention of dataset splits for validation.
Hardware Specification	No	The paper is theoretical and does not report on empirical experiments that would require specific hardware for execution. Therefore, no hardware specifications are mentioned.
Software Dependencies	No	The paper presents theoretical research, including algorithm design and proofs, rather than empirical experiments. Consequently, it does not list any specific software dependencies with version numbers for reproducing experimental results.
Experiment Setup	No	The paper is theoretical and does not describe empirical experiments. Therefore, there are no details regarding experimental setup, such as hyperparameters or training settings.