reproducibilityindex.ai

Model-Based Offline Reinforcement Learning with Local Misspecification

Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks. While our primary contribution is theoretical, our proposed method for policy selection improves over the state-of-the-art MML Voloshin, Jiang, and Yue (2021) in a simple linear Gaussian setting, and has solid performance on policy selection on a set of D4RL benchmarks. We first empirically evaluate our method on Linear-Quadratic Regulator (LQR)... We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks.
Researcher Affiliation	Academia	Kefan Dong, Yannis Flet-Berliac, Allen Nie*, Emma Brunskill Stanford University {kefandong,yfletberliac,anie,ebrun}@stanford.edu
Pseudocode	Yes	Algorithm 1: Model-based Offline RL with Local Misspecification Error
Open Source Code	No	The paper does not explicitly state that its source code is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks.
Dataset Splits	No	The paper mentions 'dataset D' and training with '5 random seeds' but does not specify the explicit percentages or counts for training, validation, and test splits for the datasets used.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper mentions software components like SAC and FQE but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We learn the dynamics using 300k iterations and we train each policy using 100k gradient iterations steps with SAC (Haarnoja et al. 2018) as the policy gradient algorithm, imitating MOPO (Yu et al. 2020) policy gradient update. We choose M = 1 and K = 5, and train each tuple for 5 random seeds on Hopper and Half Cheetah tasks.