Benchmarks for Deep Off-Policy Evaluation

Authors: Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, ziyu wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, Thomas Paine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an evaluation of state-of-the-art algorithms and provide open-source access to our data and code to foster future research in this area .
Researcher Affiliation Collaboration 1UC Berkeley 2Google Brain 3Deep Mind
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Policies and evaluation code are available at https://github.com/google-research/deep_ope. See Section 5 for links to modelling code.
Open Datasets Yes DOPE contains two domains designed to provide a more comprehensive picture of how well OPE methods perform in different settings. These two domains are constructed using two benchmarks previously proposed for offline reinforcement learning: RL Unplugged (Gulcehre et al., 2020) and D4RL (Fu et al., 2020).
Dataset Splits No The paper describes how ground truth values are obtained for policies being evaluated, but it does not specify train/validation/test splits for the datasets provided for OPE methods themselves. It states 'for each task we include a dataset of logged experiences D, and a set of policies {π1, π2, ..., πN}'. The OPE algorithms use D to estimate policy values, which are then compared to 'ground truth values' obtained by running policies for many episodes.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models) used for running its experiments.
Software Dependencies No All offline RL algorithms are implemented using the Acme framework (Hoffman et al., 2020). We use the implementation from Kostrikov & Nachum (2020)3. We use the implementation from Yang et al. (2020) corresponding to the algorithm Best DICE.4. Our implementation is based on the same network and hyperparameters for OPE setting as in Wen et al. (2020). The paper mentions frameworks and implementations used but does not provide specific version numbers for software dependencies.
Experiment Setup Yes We further tune the hyper-parameters including the regularization parameter λ, learning rates αθ and αv, and number of iterations on the Cartpole swingup task using ground-truth policy value, and then fix them for all other tasks.