Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. [...] We design a computational efficient algorithm to achieve near-optimal regret of Op SAH3K lnp1{δqq5 in K episodes using O p H log2 log2p Kqq batches with confidence parameter δ. [...] Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model. |
| Researcher Affiliation | Academia | Department of Automation, Tsinghua University, zihan-zh17@mails.tsinghua.edu.cn :Department of Automation, Tsinghua University, jiangyh19@mails.tsinghua.edu.cn ;Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, yuan-zhou@tsinghua.edu.cn Department of Automation, Tsinghua University, xyji@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 Main Algorithm, Algorithm 2 Raw Exploration, Algorithm 3 Policy Elimination |
| Open Source Code | No | The paper does not contain any statements about releasing source code, a link to a code repository, or information about code in supplementary materials. |
| Open Datasets | No | The paper is theoretical and does not involve empirical training on specific datasets. |
| Dataset Splits | No | As the paper is theoretical and does not involve empirical experiments, it does not mention validation dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe the specific hardware used for any experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations. |