Leveraging Offline Data in Online Reinforcement Learning

Authors: Andrew Wagenmaker, Aldo Pacchiano

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, FTPEDEL, which is provably optimal, up to H factors. In addition to introducing the Fine Tune RL setting, we make the following contributions: ... 2. We show there exists an algorithm, FTPEDEL, which, up to lower-order terms, only collects, for each step h, min Ton Ton s.t. Ch o2o(Doff, ϵ, Ton) 1 online episodes the minimal number of online episodes which ensures the offline-to-online concentrability coefficient is sufficiently small and returns a policy that is ϵ-optimal. Furthermore, we show that this complexity is necessary no algorithm can collect fewer online samples and return a policy guaranteed to be ϵ-optimal.
Researcher Affiliation Collaboration 1University of Washington, Seattle. 2Work done while at Microsoft Research, New York. Current Affiliation: Broad Institute of MIT and Harvard and Boston University.
Pseudocode Yes Algorithm 1 Fine-Tuning Policy Learning via Experiment Design in Linear MDPs (FTPEDEL, informal), Algorithm 2 Fine-Tuning Policy Learning via Experiment Design in Linear MDPs (FTPEDEL), Algorithm 3 Online Frank-Wolfe via Regret Minimization (FWREGRET), Algorithm 4 Collect Optimal Covariates (OPTCOV)
Open Source Code No The paper does not provide any statement or link regarding the availability of its source code.
Open Datasets No This paper is theoretical and does not involve experiments with a specific dataset. It refers to 'offline data' in a conceptual manner as part of its problem definition rather than as a concrete dataset used for training.
Dataset Splits No This paper is theoretical and does not involve experiments or dataset splits for validation.
Hardware Specification No This paper is theoretical and does not describe any experimental setup or specific hardware used.
Software Dependencies No This paper is theoretical and does not describe any experimental setup or specific software dependencies with version numbers.
Experiment Setup No This paper is theoretical and does not describe any experimental setup, hyperparameters, or training settings.