Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Authors: Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3. In Section 3 we devise an inefficient Hedge [13] based algorithm... In Section 4 we consider a relatively standard algorithm we call Delayed UOB-FTRL... In Section 5 we propose our final solution which is mainly algorithmic: we introduce the algorithm Delayed UOB-REPS... |
| Researcher Affiliation | Collaboration | Tiancheng Jin University of Southern California tiancheng.jin@usc.edu Tal Lancewicki Tel Aviv University lancewicki@mail.tau.ac.il Haipeng Luo University of Southern California haipengl@usc.edu Yishay Mansour Tel Aviv University and Google Research mansour.yishay@gmail.com Aviv Rosenberg Amazon avivros@amazon.com |
| Pseudocode | Yes | Algorithm 1 Delayed Hedge; Algorithm 2 Delayed UOB-FTRL; Algorithm 3 Delayed UOB-REPS with Delay-adapted Estimator |
| Open Source Code | No | The paper does not contain any statements about releasing source code, nor does it provide any links to a code repository. |
| Open Datasets | No | The paper is a theoretical work focusing on algorithm design and regret bounds; it does not utilize datasets for empirical evaluation, so there is no mention of publicly available or open datasets. |
| Dataset Splits | No | As a theoretical paper, it does not involve empirical evaluation with datasets, and therefore does not specify training, validation, or test data splits. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and analysis; it does not describe any specific hardware used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers needed to replicate any experimental setup, as it is a theoretical paper. |
| Experiment Setup | No | The paper is theoretical and does not describe any empirical experimental setup, including hyperparameters or system-level training settings. |