Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation
Authors: Jianliang He, Han Zhong, Zhuoran Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure average-reward generalized eluder coefficient (AGEC) which captures the challenge of exploration in AMDPs with general function approximation. Using AGEC, we prove that LOOP achieves a sublinear O(poly(d, sp(V )) Tβ) regret, where d and β correspond to AGEC and logcovering number of the hypothesis class respectively, sp(V ) is the span of the optimal state bias function, T denotes the number of steps, and O( ) omits logarithmic factors. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs. |
| Researcher Affiliation | Academia | Jianliang He Department of Statistics and Data Science Fudan University hejl@fudan.edu.cn Han Zhong Center for Data Science Peking University hanzhong@stu.pku.edu.cn Zhuoran Yang Department of Statistics and Data Science Yale University zhuoran.yang@yale.edu |
| Pseudocode | Yes | Algorithm 1 Local-fitted Optimization with Optimism LOOP(H, G, T, δ) Algorithm 2 MLE-based Local-fitted Optimization with Optimism MLE-LOOP(H, T, δ) Algorithm 3 Extended Value Iteration (EVI) |
| Open Source Code | No | The paper does not contain any statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | This is a theoretical paper and does not conduct experiments on a specific dataset. Therefore, there is no mention of a publicly available dataset. |
| Dataset Splits | No | This is a theoretical paper and does not conduct experiments. Therefore, there are no training/test/validation dataset splits mentioned. |
| Hardware Specification | No | This is a theoretical paper that does not report computational experiments. Therefore, no hardware specifications are provided. |
| Software Dependencies | No | This is a theoretical paper that does not report computational experiments. Therefore, no software dependencies with version numbers are listed. |
| Experiment Setup | No | This is a theoretical paper and does not conduct experiments. Therefore, no experimental setup details like hyperparameters or training settings are provided. |