Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Authors: Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble. [...] Our first algorithm, called Randomized Ensemble Least Squares Value Iteration (RE-LSVI), is an optimistic LSVI algorithm where the optimism is achieved via a new randomization technique. This algorithm is simple and gives our tightest regret bound. Our second algorithm, called Randomized Ensemble Policy Optimization (REPO), is based on policy optimization (PO) methods that have become extremely popular in recent years (Schulman et al., 2015; 2017) and especially in training LLMs (Ouyang et al., 2022). This is the first PO algorithm for aggregate feedback, which does not exist even in the tabular MDP setting. Theoretically, ABF is substantially more challenging under function approximation since it can no longer be formulated as a linear bandits problem (Efroni et al., 2021) (doing so would lead to a regret bound that scales with the number of states).
Researcher Affiliation Collaboration Asaf Cassel * 1 Haipeng Luo 2 3 Aviv Rosenberg * 4 Dmitry Sotnikov 3 *Work partially done while at Amazon Science. 1Blavatnik School of Computer Science, Tel Aviv University 2University of Southern California 3Amazon Science 4Google Research. Correspondence to: Asaf Cassel <acassel@mail.tau.ac.il>.
Pseudocode Yes Algorithm 2 RE-LSVI with aggregate feedback; Algorithm 3 REPO with aggregate feedback; Algorithm 4 REPO with aggregate feedback for tabular MDPs.
Open Source Code No The paper does not contain any statements about releasing open-source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets No This is a theoretical paper focusing on algorithm development and regret analysis for Linear MDPs. It does not conduct empirical experiments with datasets, and therefore does not specify public or open datasets for training.
Dataset Splits No This is a theoretical paper, and no empirical experiments involving datasets or their splits (training, validation, test) are described.
Hardware Specification No The paper is theoretical and focuses on algorithm design and analysis, not empirical experiments. Therefore, it does not specify any hardware used for running experiments.
Software Dependencies No The paper is theoretical and describes algorithms and proofs. It does not mention specific software dependencies with version numbers required to replicate any empirical work.
Experiment Setup No The paper is theoretical and focuses on algorithm design and analysis. It does not describe an experimental setup with specific hyperparameters or system-level training settings.