Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Authors: Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3. In Section 3 we devise an inefficient Hedge [13] based algorithm... In Section 4 we consider a relatively standard algorithm we call Delayed UOB-FTRL... In Section 5 we propose our final solution which is mainly algorithmic: we introduce the algorithm Delayed UOB-REPS...
Researcher Affiliation Collaboration Tiancheng Jin University of Southern California tiancheng.jin@usc.edu Tal Lancewicki Tel Aviv University lancewicki@mail.tau.ac.il Haipeng Luo University of Southern California haipengl@usc.edu Yishay Mansour Tel Aviv University and Google Research mansour.yishay@gmail.com Aviv Rosenberg Amazon avivros@amazon.com
Pseudocode Yes Algorithm 1 Delayed Hedge; Algorithm 2 Delayed UOB-FTRL; Algorithm 3 Delayed UOB-REPS with Delay-adapted Estimator
Open Source Code No The paper does not contain any statements about releasing source code, nor does it provide any links to a code repository.
Open Datasets No The paper is a theoretical work focusing on algorithm design and regret bounds; it does not utilize datasets for empirical evaluation, so there is no mention of publicly available or open datasets.
Dataset Splits No As a theoretical paper, it does not involve empirical evaluation with datasets, and therefore does not specify training, validation, or test data splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and analysis; it does not describe any specific hardware used for running experiments.
Software Dependencies No The paper does not provide specific software dependencies or version numbers needed to replicate any experimental setup, as it is a theoretical paper.
Experiment Setup No The paper is theoretical and does not describe any empirical experimental setup, including hyperparameters or system-level training settings.