Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Authors: Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal K + D regret, where K is the number of episodes and D = PK k=1 dk is the total delay, significantly improving upon the best known regret bound of (K + D)2/3. In Section 3 we devise an inefficient Hedge [13] based algorithm... In Section 4 we consider a relatively standard algorithm we call Delayed UOB-FTRL... In Section 5 we propose our final solution which is mainly algorithmic: we introduce the algorithm Delayed UOB-REPS...
Researcher Affiliation Collaboration Tiancheng Jin University of Southern California EMAIL Tal Lancewicki Tel Aviv University EMAIL Haipeng Luo University of Southern California EMAIL Yishay Mansour Tel Aviv University and Google Research EMAIL Aviv Rosenberg Amazon EMAIL
Pseudocode Yes Algorithm 1 Delayed Hedge; Algorithm 2 Delayed UOB-FTRL; Algorithm 3 Delayed UOB-REPS with Delay-adapted Estimator
Open Source Code No The paper does not contain any statements about releasing source code, nor does it provide any links to a code repository.
Open Datasets No The paper is a theoretical work focusing on algorithm design and regret bounds; it does not utilize datasets for empirical evaluation, so there is no mention of publicly available or open datasets.
Dataset Splits No As a theoretical paper, it does not involve empirical evaluation with datasets, and therefore does not specify training, validation, or test data splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and analysis; it does not describe any specific hardware used for running experiments.
Software Dependencies No The paper does not provide specific software dependencies or version numbers needed to replicate any experimental setup, as it is a theoretical paper.
Experiment Setup No The paper is theoretical and does not describe any empirical experimental setup, including hyperparameters or system-level training settings.