Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. [...] We design a computational efficient algorithm to achieve near-optimal regret of Op SAH3K lnp1{δqq5 in K episodes using O p H log2 log2p Kqq batches with confidence parameter δ. [...] Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
Researcher Affiliation Academia Department of Automation, Tsinghua University, EMAIL :Department of Automation, Tsinghua University, EMAIL ;Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, EMAIL Department of Automation, Tsinghua University, EMAIL
Pseudocode Yes Algorithm 1 Main Algorithm, Algorithm 2 Raw Exploration, Algorithm 3 Policy Elimination
Open Source Code No The paper does not contain any statements about releasing source code, a link to a code repository, or information about code in supplementary materials.
Open Datasets No The paper is theoretical and does not involve empirical training on specific datasets.
Dataset Splits No As the paper is theoretical and does not involve empirical experiments, it does not mention validation dataset splits.
Hardware Specification No The paper is theoretical and does not describe the specific hardware used for any experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations.