Safe Exploration for Efficient Policy Evaluation and Comparison
Authors: Runzhe Wan, Branislav Kveton, Rui Song
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both theoretical analysis and experiments support the usefulness of the proposed methods. Lastly, we demonstrate the superior performance of SEPEC through extensive experiments. In this section, we compare the empirical performance of various methods. |
| Researcher Affiliation | Collaboration | Runzhe Wan 1 Branislav Kveton 2 Rui Song 1 1Department of Statistics, North Carolina State University 2Amazon. |
| Pseudocode | Yes | Algorithm 1: Safe MAB exploration with the cutting-plane method. Algorithm 2: Optimization for linear bandits with the FW algorithm. Algorithm 3: Safe linear bandits exploration with the FW algorithm and the cutting-plane method. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In addition, to study the performance on real datasets, we conduct experiments using the MNIST dataset (Deng, 2012) and present the results in Figure 4, with more details given in Appendix C.3. |
| Dataset Splits | No | The paper describes data generation and sampling methods for experiments but does not provide specific training, validation, and test dataset splits with percentages or counts for reproducibility. For instance, for CMAB, it states, "we consider 30 contexts with {p(xi)} sampled from Dirichlet(130)," and for policies on MNIST, "two classifier trained with 1000 randomly sampled data points," but no standard dataset splits are detailed. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper does not specify version numbers for any key software components or libraries used in the experiments. |
| Experiment Setup | Yes | For MAB, we set K = 10, T = 500, δ = 0.05, σa 3, sample ra Uniform(0, 1), and vary the risk tolerance ϵ. For CMAB, we consider 30 contexts with {p(xi)} sampled from Dirichlet(130). For linear bandits, we sample θ from the standard multivariate normal and x from the unit sphere uniformly, with K = 100, d = 5, T = 200, and |D0| = 100. A detailed description of experiments is in Appendix C.1. |