Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback
Authors: Haolin Liu, Zak Mhammedi, Chen-Yu Wei, Julian Zimmert
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either fullinformation loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the poly(d,A,H)T 5/6 regret bound of Zhao et al. (2024) to poly(d,A,H)T 2/3 for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model-based and model-free algorithms achieving poly(d,A,H)T 2/3 regret, though they are computationally inefficient. We also propose oracle-efficient modelfree algorithms with poly(d,A,H)T 4/5 regret. |
| Researcher Affiliation | Collaboration | Haolin Liu University of Virginia srs8rh@virginia.edu Zakaria Mhammedi Google Research mhammedi@google.com Chen-Yu Wei University of Virginia chenyu.wei@virginia.edu Julian Zimmert Google Research zimmert@google.com |
| Pseudocode | Yes | Algorithm 1 Model-Based Algorithm for Full-Information Feedback |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the provision of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of training datasets or their public availability. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of validation datasets or their splits. |
| Hardware Specification | No | As a theoretical paper, it does not describe experimental setups or require specific hardware specifications for reproduction. |
| Software Dependencies | No | As a theoretical paper, it does not detail software dependencies or specific version numbers required for experimental replication. |
| Experiment Setup | No | As a theoretical paper, it focuses on mathematical analysis and algorithm design rather than empirical evaluation, and thus does not provide details on experimental setup or hyperparameters. |