Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Authors: Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3. In Section 3 we devise an inefficient Hedge [13] based algorithm... In Section 4 we consider a relatively standard algorithm we call Delayed UOB-FTRL... In Section 5 we propose our final solution which is mainly algorithmic: we introduce the algorithm Delayed UOB-REPS... |
| Researcher Affiliation | Collaboration | Tiancheng Jin University of Southern California EMAIL Tal Lancewicki Tel Aviv University EMAIL Haipeng Luo University of Southern California EMAIL Yishay Mansour Tel Aviv University and Google Research EMAIL Aviv Rosenberg Amazon EMAIL |
| Pseudocode | Yes | Algorithm 1 Delayed Hedge; Algorithm 2 Delayed UOB-FTRL; Algorithm 3 Delayed UOB-REPS with Delay-adapted Estimator |
| Open Source Code | No | The paper does not contain any statements about releasing source code, nor does it provide any links to a code repository. |
| Open Datasets | No | The paper is a theoretical work focusing on algorithm design and regret bounds; it does not utilize datasets for empirical evaluation, so there is no mention of publicly available or open datasets. |
| Dataset Splits | No | As a theoretical paper, it does not involve empirical evaluation with datasets, and therefore does not specify training, validation, or test data splits. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and analysis; it does not describe any specific hardware used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers needed to replicate any experimental setup, as it is a theoretical paper. |
| Experiment Setup | No | The paper is theoretical and does not describe any empirical experimental setup, including hyperparameters or system-level training settings. |