Demonstration-Regularized RL

Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using N E expert demonstrations enables the identification of an optimal policy at a sample complexity of order e O(Poly(S, A, H)/(ε2N E)) in finite and e O(Poly(d, H)/(ε2N E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behavior cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs.
Researcher Affiliation Collaboration Daniil Tiapkin1,2 Denis Belomestny3,2 Daniele Calandriello4 Eric Moulines1,5 Remi Munos4 Alexey Naumov2 Pierre Perrault6 Michal Valko4 Pierre M enard7 1CMAP, Ecole Polytechnique 2HSE University 3Duisburg-Essen University 4Google Deep Mind 5Mohamed Bin Zayed University of AI, UAE 6IDEMIA 7ENS Lyon
Pseudocode Yes Algorithm 1 Demonstration-regularized RL Algorithm 2 Demonstration-regularized RLHF Algorithm 3 UCBVI-Ent+ Algorithm 4 LSVI-UCB-Ent
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials.
Open Datasets No This is a theoretical paper that does not conduct empirical experiments using datasets in the traditional train/validation/test split sense. While it mentions 'expert demonstrations' as input for theoretical analysis, it does not refer to them as publicly available datasets with access information for empirical training.
Dataset Splits No This is a theoretical paper that does not conduct empirical experiments. Therefore, it does not describe training, validation, or test dataset splits.
Hardware Specification No This is a theoretical paper that does not involve empirical experiments, and therefore no hardware specifications are mentioned for running experiments.
Software Dependencies No This is a theoretical paper that describes algorithms but does not specify software dependencies with version numbers for any empirical implementation.
Experiment Setup No This is a theoretical paper that presents algorithms and theoretical guarantees. It does not include an 'Experimental Setup' section or specify hyperparameters, training configurations, or system-level settings for empirical runs.