Active Offline Policy Selection
Authors: Ksenia Konyushova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J. Mankowitz, Misha Denil, Nando de Freitas
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation 2. Section 4 shows that active policy evaluation can improve upon the OPE after merely a few interactions, thanks to the kernel that ensures data-efficiency. Additionally, our method works reliably with OPEs of varying quality and it scales well with the growing number of candidate policies. |
| Researcher Affiliation | Industry | Ksenia Konyushkova Deep Mind kksenia@deepmind.com Yutian Chen Deep Mind yutianc@deepmind.com Tom Le Paine Deep Mind tpaine@deepmind.com Caglar Gulcehre Deep Mind caglarg@deepmind.com Cosmin Paduraru Deep Mind paduraru@deepmind.com Daniel J Mankowitz Deep Mind dmankowitz@deepmind.com Misha Denil Deep Mind mdenil@deepmind.com Nando de Freitas Deep Mind nandodefreitas@deepmind.com |
| Pseudocode | No | The paper describes algorithmic steps in prose but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The paper website is at https://sites.google.com/corp/view/active-ops and the code is at https://github.com/deepmind/active_ops. |
| Open Datasets | Yes | Deep Mind Control Suite (dm-control) This is a standard set of continuous control environments [65]. Manipulation Playground (MPG) This is a simulated robotics environment. Atari This is a popular benchmark with discrete actions in online and offline RL [23]. |
| Dataset Splits | No | The paper describes how policies are selected for evaluation ('randomly select a subset of K policies') and how performance is measured ('simple regret as a function of the number of executed episodes'), but it does not specify traditional train/validation/test dataset splits for the model or method presented in the paper itself. |
| Hardware Specification | No | All online and offline RL algorithms, as well as offline policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11], and run on GPUs on an internal cluster. We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase, and run all the policy selection experiments using CPUs. The paper mentions 'GPUs', 'internal cluster', and 'CPUs' but does not provide specific models or detailed specifications. |
| Software Dependencies | No | All online and offline RL algorithms, as well as offline policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11]... We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase... The paper lists software components like Acme, Reverb, and TensorFlow, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | To evaluate the policy selection procedure in each experiment we randomly select a subset of K policies out of all trained policies. We set K = 50 for dm-control and K = 200 for MPG and Atari. Then, we repeat each experiment 100 times, and report the average results and the standard deviation of the mean estimate. We use a constant mean m without loss of generality. We assume a flat prior for the hyper-parameter m, and weakly informative inverse Gamma prior for the variance σ2 ρ and σ2 r. Finally, we compute Matérn 1/2 kernel as: K(π1, π2) = σ2 k exp( d(π1, π2)/l), where σ and l are the trainable variance and length-scale hyperparameters. FQE is the default OPE method used in our experiments unless stated otherwise. |