Demonstration-Regularized RL
Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using N E expert demonstrations enables the identification of an optimal policy at a sample complexity of order e O(Poly(S, A, H)/(ε2N E)) in finite and e O(Poly(d, H)/(ε2N E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behavior cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. |
| Researcher Affiliation | Collaboration | Daniil Tiapkin1,2 Denis Belomestny3,2 Daniele Calandriello4 Eric Moulines1,5 Remi Munos4 Alexey Naumov2 Pierre Perrault6 Michal Valko4 Pierre M enard7 1CMAP, Ecole Polytechnique 2HSE University 3Duisburg-Essen University 4Google Deep Mind 5Mohamed Bin Zayed University of AI, UAE 6IDEMIA 7ENS Lyon |
| Pseudocode | Yes | Algorithm 1 Demonstration-regularized RL Algorithm 2 Demonstration-regularized RLHF Algorithm 3 UCBVI-Ent+ Algorithm 4 LSVI-UCB-Ent |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials. |
| Open Datasets | No | This is a theoretical paper that does not conduct empirical experiments using datasets in the traditional train/validation/test split sense. While it mentions 'expert demonstrations' as input for theoretical analysis, it does not refer to them as publicly available datasets with access information for empirical training. |
| Dataset Splits | No | This is a theoretical paper that does not conduct empirical experiments. Therefore, it does not describe training, validation, or test dataset splits. |
| Hardware Specification | No | This is a theoretical paper that does not involve empirical experiments, and therefore no hardware specifications are mentioned for running experiments. |
| Software Dependencies | No | This is a theoretical paper that describes algorithms but does not specify software dependencies with version numbers for any empirical implementation. |
| Experiment Setup | No | This is a theoretical paper that presents algorithms and theoretical guarantees. It does not include an 'Experimental Setup' section or specify hyperparameters, training configurations, or system-level settings for empirical runs. |