Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback
Authors: Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble. [...] Our first algorithm, called Randomized Ensemble Least Squares Value Iteration (RE-LSVI), is an optimistic LSVI algorithm where the optimism is achieved via a new randomization technique. This algorithm is simple and gives our tightest regret bound. Our second algorithm, called Randomized Ensemble Policy Optimization (REPO), is based on policy optimization (PO) methods that have become extremely popular in recent years (Schulman et al., 2015; 2017) and especially in training LLMs (Ouyang et al., 2022). This is the first PO algorithm for aggregate feedback, which does not exist even in the tabular MDP setting. Theoretically, ABF is substantially more challenging under function approximation since it can no longer be formulated as a linear bandits problem (Efroni et al., 2021) (doing so would lead to a regret bound that scales with the number of states). |
| Researcher Affiliation | Collaboration | Asaf Cassel * 1 Haipeng Luo 2 3 Aviv Rosenberg * 4 Dmitry Sotnikov 3 *Work partially done while at Amazon Science. 1Blavatnik School of Computer Science, Tel Aviv University 2University of Southern California 3Amazon Science 4Google Research. Correspondence to: Asaf Cassel <acassel@mail.tau.ac.il>. |
| Pseudocode | Yes | Algorithm 2 RE-LSVI with aggregate feedback; Algorithm 3 REPO with aggregate feedback; Algorithm 4 REPO with aggregate feedback for tabular MDPs. |
| Open Source Code | No | The paper does not contain any statements about releasing open-source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | No | This is a theoretical paper focusing on algorithm development and regret analysis for Linear MDPs. It does not conduct empirical experiments with datasets, and therefore does not specify public or open datasets for training. |
| Dataset Splits | No | This is a theoretical paper, and no empirical experiments involving datasets or their splits (training, validation, test) are described. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and analysis, not empirical experiments. Therefore, it does not specify any hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and describes algorithms and proofs. It does not mention specific software dependencies with version numbers required to replicate any empirical work. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and analysis. It does not describe an experimental setup with specific hyperparameters or system-level training settings. |