Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Authors: Weitong ZHANG, Dongruo Zhou, Quanquan Gu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption... We show that to obtain an ϵ-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most e O(H5d2ϵ 2) episodes during the exploration phase. Here, H is the length of the episode, d is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most e O(H4d(H + d)ϵ 2) to achieve an ϵ-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least eΩ(H2dϵ 2) episodes to obtain an ϵ-optimal policy. Our upper bound matches the lower bound in terms of the dependence on ϵ and the dependence on d if H d. ... If you ran experiments... [N/A]
Researcher Affiliation Academia Weitong Zhang Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 weightzero@cs.ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 UCRL-RFE Planning Module (PLAN) ... Algorithm 2 UCRL-RFE (Hoeffding Bonus) ... Algorithm 3 UCRL-RFE+ (Bernstein Bonus)
Open Source Code No The paper does not explicitly state that it provides open-source code for the described methodology or link to a code repository. The 'checks' section states '3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A]'.
Open Datasets No The paper is theoretical and does not conduct empirical experiments on datasets; therefore, it does not mention public dataset availability or use specific datasets for training.
Dataset Splits No The paper is theoretical and does not conduct empirical experiments; thus, it does not describe training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and does not report on experimental hardware. The 'checks' section indicates 'N/A' for experiments and associated details.
Software Dependencies No The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe any specific software dependencies with version numbers required to reproduce experiments.
Experiment Setup No The paper is theoretical and does not include details about an experimental setup, such as specific hyperparameters or system-level training settings.