Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Authors: Haolin Liu, Zak Mhammedi, Chen-Yu Wei, Julian Zimmert

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either fullinformation loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the poly(d,A,H)T 5/6 regret bound of Zhao et al. (2024) to poly(d,A,H)T 2/3 for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model-based and model-free algorithms achieving poly(d,A,H)T 2/3 regret, though they are computationally inefficient. We also propose oracle-efficient modelfree algorithms with poly(d,A,H)T 4/5 regret.
Researcher Affiliation Collaboration Haolin Liu University of Virginia srs8rh@virginia.edu Zakaria Mhammedi Google Research mhammedi@google.com Chen-Yu Wei University of Virginia chenyu.wei@virginia.edu Julian Zimmert Google Research zimmert@google.com
Pseudocode Yes Algorithm 1 Model-Based Algorithm for Full-Information Feedback
Open Source Code No The paper does not contain any explicit statements or links indicating the provision of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of training datasets or their public availability.
Dataset Splits No The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of validation datasets or their splits.
Hardware Specification No As a theoretical paper, it does not describe experimental setups or require specific hardware specifications for reproduction.
Software Dependencies No As a theoretical paper, it does not detail software dependencies or specific version numbers required for experimental replication.
Experiment Setup No As a theoretical paper, it focuses on mathematical analysis and algorithm design rather than empirical evaluation, and thus does not provide details on experimental setup or hyperparameters.