Leveraging Offline Data in Online Reinforcement Learning
Authors: Andrew Wagenmaker, Aldo Pacchiano
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, FTPEDEL, which is provably optimal, up to H factors. In addition to introducing the Fine Tune RL setting, we make the following contributions: ... 2. We show there exists an algorithm, FTPEDEL, which, up to lower-order terms, only collects, for each step h, min Ton Ton s.t. Ch o2o(Doff, ϵ, Ton) 1 online episodes the minimal number of online episodes which ensures the offline-to-online concentrability coefficient is sufficiently small and returns a policy that is ϵ-optimal. Furthermore, we show that this complexity is necessary no algorithm can collect fewer online samples and return a policy guaranteed to be ϵ-optimal. |
| Researcher Affiliation | Collaboration | 1University of Washington, Seattle. 2Work done while at Microsoft Research, New York. Current Affiliation: Broad Institute of MIT and Harvard and Boston University. |
| Pseudocode | Yes | Algorithm 1 Fine-Tuning Policy Learning via Experiment Design in Linear MDPs (FTPEDEL, informal), Algorithm 2 Fine-Tuning Policy Learning via Experiment Design in Linear MDPs (FTPEDEL), Algorithm 3 Online Frank-Wolfe via Regret Minimization (FWREGRET), Algorithm 4 Collect Optimal Covariates (OPTCOV) |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of its source code. |
| Open Datasets | No | This paper is theoretical and does not involve experiments with a specific dataset. It refers to 'offline data' in a conceptual manner as part of its problem definition rather than as a concrete dataset used for training. |
| Dataset Splits | No | This paper is theoretical and does not involve experiments or dataset splits for validation. |
| Hardware Specification | No | This paper is theoretical and does not describe any experimental setup or specific hardware used. |
| Software Dependencies | No | This paper is theoretical and does not describe any experimental setup or specific software dependencies with version numbers. |
| Experiment Setup | No | This paper is theoretical and does not describe any experimental setup, hyperparameters, or training settings. |