Benchmarks for Deep Off-Policy Evaluation
Authors: Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, ziyu wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, Thomas Paine
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an evaluation of state-of-the-art algorithms and provide open-source access to our data and code to foster future research in this area . |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Google Brain 3Deep Mind |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Policies and evaluation code are available at https://github.com/google-research/deep_ope. See Section 5 for links to modelling code. |
| Open Datasets | Yes | DOPE contains two domains designed to provide a more comprehensive picture of how well OPE methods perform in different settings. These two domains are constructed using two benchmarks previously proposed for offline reinforcement learning: RL Unplugged (Gulcehre et al., 2020) and D4RL (Fu et al., 2020). |
| Dataset Splits | No | The paper describes how ground truth values are obtained for policies being evaluated, but it does not specify train/validation/test splits for the datasets provided for OPE methods themselves. It states 'for each task we include a dataset of logged experiences D, and a set of policies {π1, π2, ..., πN}'. The OPE algorithms use D to estimate policy values, which are then compared to 'ground truth values' obtained by running policies for many episodes. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models) used for running its experiments. |
| Software Dependencies | No | All offline RL algorithms are implemented using the Acme framework (Hoffman et al., 2020). We use the implementation from Kostrikov & Nachum (2020)3. We use the implementation from Yang et al. (2020) corresponding to the algorithm Best DICE.4. Our implementation is based on the same network and hyperparameters for OPE setting as in Wen et al. (2020). The paper mentions frameworks and implementations used but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We further tune the hyper-parameters including the regularization parameter λ, learning rates αθ and αv, and number of iterations on the Cartpole swingup task using ground-truth policy value, and then fix them for all other tasks. |