Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Authors: Haolin Liu, Zak Mhammedi, Chen-Yu Wei, Julian Zimmert

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either fullinformation loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the poly(d,A,H)T 5/6 regret bound of Zhao et al. (2024) to poly(d,A,H)T 2/3 for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model-based and model-free algorithms achieving poly(d,A,H)T 2/3 regret, though they are computationally inefficient. We also propose oracle-efficient modelfree algorithms with poly(d,A,H)T 4/5 regret.
Researcher Affiliation Collaboration Haolin Liu University of Virginia EMAIL Zakaria Mhammedi Google Research EMAIL Chen-Yu Wei University of Virginia EMAIL Julian Zimmert Google Research EMAIL
Pseudocode Yes Algorithm 1 Model-Based Algorithm for Full-Information Feedback
Open Source Code No The paper does not contain any explicit statements or links indicating the provision of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of training datasets or their public availability.
Dataset Splits No The paper is theoretical and does not involve empirical studies with data, therefore, there is no mention of validation datasets or their splits.
Hardware Specification No As a theoretical paper, it does not describe experimental setups or require specific hardware specifications for reproduction.
Software Dependencies No As a theoretical paper, it does not detail software dependencies or specific version numbers required for experimental replication.
Experiment Setup No As a theoretical paper, it focuses on mathematical analysis and algorithm design rather than empirical evaluation, and thus does not provide details on experimental setup or hyperparameters.