reproducibilityindex.ai

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. [...] We design a computational efficient algorithm to achieve near-optimal regret of Op SAH3K lnp1{δqq5 in K episodes using O p H log2 log2p Kqq batches with confidence parameter δ. [...] Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
Researcher Affiliation	Academia	Department of Automation, Tsinghua University, zihan-zh17@mails.tsinghua.edu.cn :Department of Automation, Tsinghua University, jiangyh19@mails.tsinghua.edu.cn ;Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, yuan-zhou@tsinghua.edu.cn Department of Automation, Tsinghua University, xyji@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1 Main Algorithm, Algorithm 2 Raw Exploration, Algorithm 3 Policy Elimination
Open Source Code	No	The paper does not contain any statements about releasing source code, a link to a code repository, or information about code in supplementary materials.
Open Datasets	No	The paper is theoretical and does not involve empirical training on specific datasets.
Dataset Splits	No	As the paper is theoretical and does not involve empirical experiments, it does not mention validation dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe the specific hardware used for any experiments.
Software Dependencies	No	The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations.