Model-Based Offline Reinforcement Learning with Local Misspecification
Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks. While our primary contribution is theoretical, our proposed method for policy selection improves over the state-of-the-art MML Voloshin, Jiang, and Yue (2021) in a simple linear Gaussian setting, and has solid performance on policy selection on a set of D4RL benchmarks. We first empirically evaluate our method on Linear-Quadratic Regulator (LQR)... We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks. |
| Researcher Affiliation | Academia | Kefan Dong*, Yannis Flet-Berliac*, Allen Nie*, Emma Brunskill Stanford University {kefandong,yfletberliac,anie,ebrun}@stanford.edu |
| Pseudocode | Yes | Algorithm 1: Model-based Offline RL with Local Misspecification Error |
| Open Source Code | No | The paper does not explicitly state that its source code is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks. |
| Dataset Splits | No | The paper mentions 'dataset D' and training with '5 random seeds' but does not specify the explicit percentages or counts for training, validation, and test splits for the datasets used. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like SAC and FQE but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We learn the dynamics using 300k iterations and we train each policy using 100k gradient iterations steps with SAC (Haarnoja et al. 2018) as the policy gradient algorithm, imitating MOPO (Yu et al. 2020) policy gradient update. We choose M = 1 and K = 5, and train each tuple for 5 random seeds on Hopper and Half Cheetah tasks. |