Empirical Likelihood for Contextual Bandits
Authors: Nikos Karampatziakis, John Langford, Paul Mineiro
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically find that both our estimator and confidence interval improve over previous proposals in finite sample regimes. Finally, the policy optimization algorithm we propose outperforms a strong baseline system for learning from off-policy data. |
| Researcher Affiliation | Industry | Nikos Karampatziakis Microsoft Dynamics 365 AI nikosk@microsoft.com John Langford Microsoft Research jcl@microsoft.com Paul Mineiro Microsoft Research pmineiro@microsoft.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Replication instructions are available in the supplement, and replication software is available at http://github.com/pmineiro/elfcb. |
| Open Datasets | Yes | We use 40 classification datasets from Open ML [31]; apply a supervised-to-bandit transform [9]; and limit the datasets to 10,000 examples. |
| Dataset Splits | Yes | Each dataset is randomly split 20%/60%/20% into Initialize/Learn/Evaluate subsets, to learn h, learn π, and evaluate π respectively. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Vowpal Wabbit' but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | We made no effort to tune the confidence level setting it to 95% for all experiments. For optimizing the policy parameters and the distribution dual variables, we alternate between solving the dual problem with the policy fixed and then optimizing the policy with the dual variables fixed. To optimize the policy we do a single pass over the data using Vowpal Wabbit as a black-box oracle for learning, supplying different importance weights on each example depending upon the dual variables. We do 4 passes over the learning set and update the dual variables before each pass. |