Oracle Inequalities for Model Selection in Offline Reinforcement Learning
Authors: Jonathan N Lee, George Tucker, Ofir Nachum, Bo Dai, Emma Brunskill
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with several numerical simulations showing it is capable of reliably selecting a good model class. In Section 5, we demonstrate the effectiveness of MODBE on several simulated experimental domains. We use neural network-based offline RL algorithms as baselines and show that MODBE is able to reliably select a good model class. |
| Researcher Affiliation | Collaboration | Jonathan N. Lee Stanford University jnl@stanford.edu George Tucker Google Research gjt@google.com Ofir Nachum Google Research ofirnachum@google.com Bo Dai Google Research bodai@google.com Emma Brunskill Stanford University ebrun@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 Model Selection via Bellman Error (MODBE) |
| Open Source Code | Yes | 1Supplementary material is available at: https://sites.google.com/stanford.edu/offline-model-selection. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | We evaluated MODBE in three simulated environments with discrete actions: (1) synthetic contextual bandits (CB), (2) Gym Cart Pole, (3) Gym Mountain Car. |
| Dataset Splits | Yes | All training and validation sets were split 80/20. Let ntrain =d0.8 ne and nvalid =b0.2 nc and split the dataset D randomly into Dtrain =(Dtrain,h) of ntrain samples and Dvalid =(Dvalid,h) of nvalid samples for each h2[H]. |
| Hardware Specification | No | The paper's self-checklist indicates that hardware details are included ("Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]"), but these specific details are not present in the provided paper text or Appendix D. |
| Software Dependencies | No | Our setup for the RL problems in Gym (Brockman et al., 2016) builds on top of the open-source d3rlpy framework (Seno and Imai, 2021). We used DQN (Mnih et al., 2015), which is closest to FQI." The paper mentions software frameworks and libraries but does not provide specific version numbers for them. |
| Experiment Setup | Yes | All training and validation sets were split 80/20. We used DQN (Mnih et al., 2015), which is closest to FQI. We considered model classes that were two-layer neural networks with Re LU activations and d nodes in the hidden layer and varied the parameter d. |