Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Model-Based Offline Reinforcement Learning with Local Misspecification
Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks. While our primary contribution is theoretical, our proposed method for policy selection improves over the state-of-the-art MML Voloshin, Jiang, and Yue (2021) in a simple linear Gaussian setting, and has solid performance on policy selection on a set of D4RL benchmarks. We first empirically evaluate our method on Linear-Quadratic Regulator (LQR)... We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks. |
| Researcher Affiliation | Academia | Kefan Dong*, Yannis Flet-Berliac*, Allen Nie*, Emma Brunskill Stanford University EMAIL |
| Pseudocode | Yes | Algorithm 1: Model-based Offline RL with Local Misspecification Error |
| Open Source Code | No | The paper does not explicitly state that its source code is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We also evaluate our approach using D4RL (Fu et al. 2020), a standard offline RL benchmark for continuous control tasks. |
| Dataset Splits | No | The paper mentions 'dataset D' and training with '5 random seeds' but does not specify the explicit percentages or counts for training, validation, and test splits for the datasets used. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like SAC and FQE but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We learn the dynamics using 300k iterations and we train each policy using 100k gradient iterations steps with SAC (Haarnoja et al. 2018) as the policy gradient algorithm, imitating MOPO (Yu et al. 2020) policy gradient update. We choose M = 1 and K = 5, and train each tuple for 5 random seeds on Hopper and Half Cheetah tasks. |