reproducibilityindex.ai

Benchmarks for Deep Off-Policy Evaluation

Authors: Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, ziyu wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, Thomas Paine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform an evaluation of state-of-the-art algorithms and provide open-source access to our data and code to foster future research in this area .
Researcher Affiliation	Collaboration	1UC Berkeley 2Google Brain 3Deep Mind
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Policies and evaluation code are available at https://github.com/google-research/deep_ope. See Section 5 for links to modelling code.
Open Datasets	Yes	DOPE contains two domains designed to provide a more comprehensive picture of how well OPE methods perform in different settings. These two domains are constructed using two benchmarks previously proposed for ofﬂine reinforcement learning: RL Unplugged (Gulcehre et al., 2020) and D4RL (Fu et al., 2020).
Dataset Splits	No	The paper describes how ground truth values are obtained for policies being evaluated, but it does not specify train/validation/test splits for the datasets provided for OPE methods themselves. It states 'for each task we include a dataset of logged experiences D, and a set of policies {π1, π2, ..., πN}'. The OPE algorithms use D to estimate policy values, which are then compared to 'ground truth values' obtained by running policies for many episodes.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models) used for running its experiments.
Software Dependencies	No	All ofﬂine RL algorithms are implemented using the Acme framework (Hoffman et al., 2020). We use the implementation from Kostrikov & Nachum (2020)3. We use the implementation from Yang et al. (2020) corresponding to the algorithm Best DICE.4. Our implementation is based on the same network and hyperparameters for OPE setting as in Wen et al. (2020). The paper mentions frameworks and implementations used but does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	We further tune the hyper-parameters including the regularization parameter λ, learning rates αθ and αv, and number of iterations on the Cartpole swingup task using ground-truth policy value, and then ﬁx them for all other tasks.