Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
Authors: Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. [...] We design a computational efficient algorithm to achieve near-optimal regret of Op SAH3K lnp1{δqq5 in K episodes using O p H log2 log2p Kqq batches with confidence parameter δ. [...] Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model. |
| Researcher Affiliation | Academia | Department of Automation, Tsinghua University, EMAIL :Department of Automation, Tsinghua University, EMAIL ;Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, EMAIL Department of Automation, Tsinghua University, EMAIL |
| Pseudocode | Yes | Algorithm 1 Main Algorithm, Algorithm 2 Raw Exploration, Algorithm 3 Policy Elimination |
| Open Source Code | No | The paper does not contain any statements about releasing source code, a link to a code repository, or information about code in supplementary materials. |
| Open Datasets | No | The paper is theoretical and does not involve empirical training on specific datasets. |
| Dataset Splits | No | As the paper is theoretical and does not involve empirical experiments, it does not mention validation dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe the specific hardware used for any experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations. |