Offline RL Without Off-Policy Evaluation
Authors: David Brandfonbrener, Will Whitney, Rajesh Ranganath, Joan Bruna
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main empirical finding is that one step of policy improvement is sufficient to beat state of the art results on much of the D4RL benchmark suite [Fu et al., 2020]. Results are shown in Table 1. |
| Researcher Affiliation | Collaboration | David Brandfonbrener William F. Whitney Rajesh Ranganath Joan Bruna Department of Computer Science, Center for Data Science New York University david.brandfonbrener@nyu.edu ... This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study. |
| Pseudocode | Yes | Algorithm 1: OAMPI |
| Open Source Code | Yes | Full experimental details are in Appendix C and code can be found at https://github.com/davidbrandfonbrener/onestep-rl. |
| Open Datasets | Yes | D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. ... Data from Fu et al. [2020]. The license is Apache 2.0. |
| Dataset Splits | Yes | We chose the best performing model by evaluation performance on the validation data. This is what we mean by allowing access to the environment for hyperparameter tuning. ... We report the mean and standard error over 10 seeds of the training process and using 100 evaluation episodes per seed. |
| Hardware Specification | Yes | All models were trained on a single NVIDIA 2080 Ti. |
| Software Dependencies | No | The paper states: 'Code is written in PyTorch.' However, it does not specify the version number of PyTorch or any other software dependencies, which is required for a reproducible description. |
| Experiment Setup | Yes | Following Fu et al. [2020] and others in this line of work, we allow access to the environment to tune a small (< 10) set of hyperparameters. ... Each algorithm is tuned over 6 values of their respective hyperparameter. ... We report the mean and standard error over 10 seeds of the training process and using 100 evaluation episodes per seed. |