reproducibilityindex.ai

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Authors: Weitong ZHANG, Dongruo Zhou, Quanquan Gu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption... We show that to obtain an ϵ-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most e O(H5d2ϵ 2) episodes during the exploration phase. Here, H is the length of the episode, d is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most e O(H4d(H + d)ϵ 2) to achieve an ϵ-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least eΩ(H2dϵ 2) episodes to obtain an ϵ-optimal policy. Our upper bound matches the lower bound in terms of the dependence on ϵ and the dependence on d if H d. ... If you ran experiments... [N/A]
Researcher Affiliation	Academia	Weitong Zhang Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 weightzero@cs.ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 qgu@cs.ucla.edu
Pseudocode	Yes	Algorithm 1 UCRL-RFE Planning Module (PLAN) ... Algorithm 2 UCRL-RFE (Hoeffding Bonus) ... Algorithm 3 UCRL-RFE+ (Bernstein Bonus)
Open Source Code	No	The paper does not explicitly state that it provides open-source code for the described methodology or link to a code repository. The 'checks' section states '3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A]'.
Open Datasets	No	The paper is theoretical and does not conduct empirical experiments on datasets; therefore, it does not mention public dataset availability or use specific datasets for training.
Dataset Splits	No	The paper is theoretical and does not conduct empirical experiments; thus, it does not describe training, validation, or test dataset splits.
Hardware Specification	No	The paper is theoretical and does not report on experimental hardware. The 'checks' section indicates 'N/A' for experiments and associated details.
Software Dependencies	No	The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe any specific software dependencies with version numbers required to reproduce experiments.
Experiment Setup	No	The paper is theoretical and does not include details about an experimental setup, such as specific hyperparameters or system-level training settings.