reproducibilityindex.ai

Offline RL Without Off-Policy Evaluation

Authors: David Brandfonbrener, Will Whitney, Rajesh Ranganath, Joan Bruna

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main empirical finding is that one step of policy improvement is sufficient to beat state of the art results on much of the D4RL benchmark suite [Fu et al., 2020]. Results are shown in Table 1.
Researcher Affiliation	Collaboration	David Brandfonbrener William F. Whitney Rajesh Ranganath Joan Bruna Department of Computer Science, Center for Data Science New York University david.brandfonbrener@nyu.edu ... This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study.
Pseudocode	Yes	Algorithm 1: OAMPI
Open Source Code	Yes	Full experimental details are in Appendix C and code can be found at https://github.com/davidbrandfonbrener/onestep-rl.
Open Datasets	Yes	D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. ... Data from Fu et al. [2020]. The license is Apache 2.0.
Dataset Splits	Yes	We chose the best performing model by evaluation performance on the validation data. This is what we mean by allowing access to the environment for hyperparameter tuning. ... We report the mean and standard error over 10 seeds of the training process and using 100 evaluation episodes per seed.
Hardware Specification	Yes	All models were trained on a single NVIDIA 2080 Ti.
Software Dependencies	No	The paper states: 'Code is written in PyTorch.' However, it does not specify the version number of PyTorch or any other software dependencies, which is required for a reproducible description.
Experiment Setup	Yes	Following Fu et al. [2020] and others in this line of work, we allow access to the environment to tune a small (< 10) set of hyperparameters. ... Each algorithm is tuned over 6 values of their respective hyperparameter. ... We report the mean and standard error over 10 seeds of the training process and using 100 evaluation episodes per seed.