Model-Based Offline Reinforcement Learning with Local Misspecification

Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks. While our primary contribution is theoretical, our proposed method for policy selection improves over the state-of-the-art MML Voloshin, Jiang, and Yue (2021) in a simple linear Gaussian setting, and has solid performance on policy selection on a set of D4RL benchmarks. We first empirically evaluate our method on Linear-Quadratic Regulator (LQR)... We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks.
Researcher Affiliation Academia Kefan Dong*, Yannis Flet-Berliac*, Allen Nie*, Emma Brunskill Stanford University {kefandong,yfletberliac,anie,ebrun}@stanford.edu
Pseudocode Yes Algorithm 1: Model-based Offline RL with Local Misspecification Error
Open Source Code No The paper does not explicitly state that its source code is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks.
Dataset Splits No The paper mentions 'dataset D' and training with '5 random seeds' but does not specify the explicit percentages or counts for training, validation, and test splits for the datasets used.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions software components like SAC and FQE but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We learn the dynamics using 300k iterations and we train each policy using 100k gradient iterations steps with SAC (Haarnoja et al. 2018) as the policy gradient algorithm, imitating MOPO (Yu et al. 2020) policy gradient update. We choose M = 1 and K = 5, and train each tuple for 5 random seeds on Hopper and Half Cheetah tasks.