reproducibilityindex.ai

Active Offline Policy Selection

Authors: Ksenia Konyushova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J. Mankowitz, Misha Denil, Nando de Freitas

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation 2. Section 4 shows that active policy evaluation can improve upon the OPE after merely a few interactions, thanks to the kernel that ensures data-efﬁciency. Additionally, our method works reliably with OPEs of varying quality and it scales well with the growing number of candidate policies.
Researcher Affiliation	Industry	Ksenia Konyushkova Deep Mind kksenia@deepmind.com Yutian Chen Deep Mind yutianc@deepmind.com Tom Le Paine Deep Mind tpaine@deepmind.com Caglar Gulcehre Deep Mind caglarg@deepmind.com Cosmin Paduraru Deep Mind paduraru@deepmind.com Daniel J Mankowitz Deep Mind dmankowitz@deepmind.com Misha Denil Deep Mind mdenil@deepmind.com Nando de Freitas Deep Mind nandodefreitas@deepmind.com
Pseudocode	No	The paper describes algorithmic steps in prose but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The paper website is at https://sites.google.com/corp/view/active-ops and the code is at https://github.com/deepmind/active_ops.
Open Datasets	Yes	Deep Mind Control Suite (dm-control) This is a standard set of continuous control environments [65]. Manipulation Playground (MPG) This is a simulated robotics environment. Atari This is a popular benchmark with discrete actions in online and ofﬂine RL [23].
Dataset Splits	No	The paper describes how policies are selected for evaluation ('randomly select a subset of K policies') and how performance is measured ('simple regret as a function of the number of executed episodes'), but it does not specify traditional train/validation/test dataset splits for the model or method presented in the paper itself.
Hardware Specification	No	All online and ofﬂine RL algorithms, as well as ofﬂine policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11], and run on GPUs on an internal cluster. We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase, and run all the policy selection experiments using CPUs. The paper mentions 'GPUs', 'internal cluster', and 'CPUs' but does not provide specific models or detailed specifications.
Software Dependencies	No	All online and ofﬂine RL algorithms, as well as ofﬂine policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11]... We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase... The paper lists software components like Acme, Reverb, and TensorFlow, but does not provide specific version numbers for any of them.
Experiment Setup	Yes	To evaluate the policy selection procedure in each experiment we randomly select a subset of K policies out of all trained policies. We set K = 50 for dm-control and K = 200 for MPG and Atari. Then, we repeat each experiment 100 times, and report the average results and the standard deviation of the mean estimate. We use a constant mean m without loss of generality. We assume a ﬂat prior for the hyper-parameter m, and weakly informative inverse Gamma prior for the variance σ2 ρ and σ2 r. Finally, we compute Matérn 1/2 kernel as: K(π1, π2) = σ2 k exp( d(π1, π2)/l), where σ and l are the trainable variance and length-scale hyperparameters. FQE is the default OPE method used in our experiments unless stated otherwise.