reproducibilityindex.ai

Empirical Likelihood for Contextual Bandits

Authors: Nikos Karampatziakis, John Langford, Paul Mineiro

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically find that both our estimator and confidence interval improve over previous proposals in finite sample regimes. Finally, the policy optimization algorithm we propose outperforms a strong baseline system for learning from off-policy data.
Researcher Affiliation	Industry	Nikos Karampatziakis Microsoft Dynamics 365 AI nikosk@microsoft.com John Langford Microsoft Research jcl@microsoft.com Paul Mineiro Microsoft Research pmineiro@microsoft.com
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Replication instructions are available in the supplement, and replication software is available at http://github.com/pmineiro/elfcb.
Open Datasets	Yes	We use 40 classification datasets from Open ML [31]; apply a supervised-to-bandit transform [9]; and limit the datasets to 10,000 examples.
Dataset Splits	Yes	Each dataset is randomly split 20%/60%/20% into Initialize/Learn/Evaluate subsets, to learn h, learn π, and evaluate π respectively.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Vowpal Wabbit' but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup	Yes	We made no effort to tune the confidence level setting it to 95% for all experiments. For optimizing the policy parameters and the distribution dual variables, we alternate between solving the dual problem with the policy fixed and then optimizing the policy with the dual variables fixed. To optimize the policy we do a single pass over the data using Vowpal Wabbit as a black-box oracle for learning, supplying different importance weights on each example depending upon the dual variables. We do 4 passes over the learning set and update the dual variables before each pass.