reproducibilityindex.ai

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Authors: Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3. In Section 3 we devise an inefficient Hedge [13] based algorithm... In Section 4 we consider a relatively standard algorithm we call Delayed UOB-FTRL... In Section 5 we propose our final solution which is mainly algorithmic: we introduce the algorithm Delayed UOB-REPS...
Researcher Affiliation	Collaboration	Tiancheng Jin University of Southern California tiancheng.jin@usc.edu Tal Lancewicki Tel Aviv University lancewicki@mail.tau.ac.il Haipeng Luo University of Southern California haipengl@usc.edu Yishay Mansour Tel Aviv University and Google Research mansour.yishay@gmail.com Aviv Rosenberg Amazon avivros@amazon.com
Pseudocode	Yes	Algorithm 1 Delayed Hedge; Algorithm 2 Delayed UOB-FTRL; Algorithm 3 Delayed UOB-REPS with Delay-adapted Estimator
Open Source Code	No	The paper does not contain any statements about releasing source code, nor does it provide any links to a code repository.
Open Datasets	No	The paper is a theoretical work focusing on algorithm design and regret bounds; it does not utilize datasets for empirical evaluation, so there is no mention of publicly available or open datasets.
Dataset Splits	No	As a theoretical paper, it does not involve empirical evaluation with datasets, and therefore does not specify training, validation, or test data splits.
Hardware Specification	No	The paper is theoretical and focuses on algorithm design and analysis; it does not describe any specific hardware used for running experiments.
Software Dependencies	No	The paper does not provide specific software dependencies or version numbers needed to replicate any experimental setup, as it is a theoretical paper.
Experiment Setup	No	The paper is theoretical and does not describe any empirical experimental setup, including hyperparameters or system-level training settings.