Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Authors: Jianliang He, Han Zhong, Zhuoran Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure average-reward generalized eluder coefficient (AGEC) which captures the challenge of exploration in AMDPs with general function approximation. Using AGEC, we prove that LOOP achieves a sublinear O(poly(d, sp(V )) Tβ) regret, where d and β correspond to AGEC and logcovering number of the hypothesis class respectively, sp(V ) is the span of the optimal state bias function, T denotes the number of steps, and O( ) omits logarithmic factors. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.
Researcher Affiliation Academia Jianliang He Department of Statistics and Data Science Fudan University hejl@fudan.edu.cn Han Zhong Center for Data Science Peking University hanzhong@stu.pku.edu.cn Zhuoran Yang Department of Statistics and Data Science Yale University zhuoran.yang@yale.edu
Pseudocode Yes Algorithm 1 Local-fitted Optimization with Optimism LOOP(H, G, T, δ) Algorithm 2 MLE-based Local-fitted Optimization with Optimism MLE-LOOP(H, T, δ) Algorithm 3 Extended Value Iteration (EVI)
Open Source Code No The paper does not contain any statement or link indicating that source code for the described methodology is publicly available.
Open Datasets No This is a theoretical paper and does not conduct experiments on a specific dataset. Therefore, there is no mention of a publicly available dataset.
Dataset Splits No This is a theoretical paper and does not conduct experiments. Therefore, there are no training/test/validation dataset splits mentioned.
Hardware Specification No This is a theoretical paper that does not report computational experiments. Therefore, no hardware specifications are provided.
Software Dependencies No This is a theoretical paper that does not report computational experiments. Therefore, no software dependencies with version numbers are listed.
Experiment Setup No This is a theoretical paper and does not conduct experiments. Therefore, no experimental setup details like hyperparameters or training settings are provided.