Active Offline Policy Selection

Authors: Ksenia Konyushova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J. Mankowitz, Misha Denil, Nando de Freitas

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation 2. Section 4 shows that active policy evaluation can improve upon the OPE after merely a few interactions, thanks to the kernel that ensures data-efficiency. Additionally, our method works reliably with OPEs of varying quality and it scales well with the growing number of candidate policies.
Researcher Affiliation Industry Ksenia Konyushkova Deep Mind kksenia@deepmind.com Yutian Chen Deep Mind yutianc@deepmind.com Tom Le Paine Deep Mind tpaine@deepmind.com Caglar Gulcehre Deep Mind caglarg@deepmind.com Cosmin Paduraru Deep Mind paduraru@deepmind.com Daniel J Mankowitz Deep Mind dmankowitz@deepmind.com Misha Denil Deep Mind mdenil@deepmind.com Nando de Freitas Deep Mind nandodefreitas@deepmind.com
Pseudocode No The paper describes algorithmic steps in prose but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The paper website is at https://sites.google.com/corp/view/active-ops and the code is at https://github.com/deepmind/active_ops.
Open Datasets Yes Deep Mind Control Suite (dm-control) This is a standard set of continuous control environments [65]. Manipulation Playground (MPG) This is a simulated robotics environment. Atari This is a popular benchmark with discrete actions in online and offline RL [23].
Dataset Splits No The paper describes how policies are selected for evaluation ('randomly select a subset of K policies') and how performance is measured ('simple regret as a function of the number of executed episodes'), but it does not specify traditional train/validation/test dataset splits for the model or method presented in the paper itself.
Hardware Specification No All online and offline RL algorithms, as well as offline policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11], and run on GPUs on an internal cluster. We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase, and run all the policy selection experiments using CPUs. The paper mentions 'GPUs', 'internal cluster', and 'CPUs' but does not provide specific models or detailed specifications.
Software Dependencies No All online and offline RL algorithms, as well as offline policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11]... We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase... The paper lists software components like Acme, Reverb, and TensorFlow, but does not provide specific version numbers for any of them.
Experiment Setup Yes To evaluate the policy selection procedure in each experiment we randomly select a subset of K policies out of all trained policies. We set K = 50 for dm-control and K = 200 for MPG and Atari. Then, we repeat each experiment 100 times, and report the average results and the standard deviation of the mean estimate. We use a constant mean m without loss of generality. We assume a flat prior for the hyper-parameter m, and weakly informative inverse Gamma prior for the variance σ2 ρ and σ2 r. Finally, we compute Matérn 1/2 kernel as: K(π1, π2) = σ2 k exp( d(π1, π2)/l), where σ and l are the trainable variance and length-scale hyperparameters. FQE is the default OPE method used in our experiments unless stated otherwise.