Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. [...] We design a computational efficient algorithm to achieve near-optimal regret of Op SAH3K lnp1{δqq5 in K episodes using O p H log2 log2p Kqq batches with confidence parameter δ. [...] Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
Researcher Affiliation Academia Department of Automation, Tsinghua University, zihan-zh17@mails.tsinghua.edu.cn :Department of Automation, Tsinghua University, jiangyh19@mails.tsinghua.edu.cn ;Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, yuan-zhou@tsinghua.edu.cn Department of Automation, Tsinghua University, xyji@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 Main Algorithm, Algorithm 2 Raw Exploration, Algorithm 3 Policy Elimination
Open Source Code No The paper does not contain any statements about releasing source code, a link to a code repository, or information about code in supplementary materials.
Open Datasets No The paper is theoretical and does not involve empirical training on specific datasets.
Dataset Splits No As the paper is theoretical and does not involve empirical experiments, it does not mention validation dataset splits.
Hardware Specification No The paper is theoretical and does not describe the specific hardware used for any experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations.