On the Design of Estimators for Bandit Off-Policy Evaluation

Authors: Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our main results in the context of multi-armed bandits, and we decribe a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets. We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014). In Table 1 we report the RMSE of the different estimators for each benchmark.
Researcher Affiliation Collaboration 1Netflix, Los Gatos CA, USA 2Department of Biostatistics, University of California Berkeley, Berkeley, USA.
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The code for all experiments is available by request from the authors.
Open Datasets Yes We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014).
Dataset Splits No The paper mentions splitting data into training and test sets but does not provide specific details on a separate validation split or the exact percentages/counts for these splits.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) used for experiments are provided.
Software Dependencies No No specific software dependencies with version numbers are mentioned (e.g., library names with their versions).
Experiment Setup Yes For the evaluation on a dataset, we follow the methodology of Dud ık et al. (2014). We randomly split data into training and test sets of the same size. We run logistic regression to obtain a classifier π logging policy µ selects label π(x) with probability ϵ = 0.05 and with probability 1 ϵ the logging policy µ selects one of the other labels {1, 2, . . . , K}\π(x) uniformly at random. we use a linear loss model ˆr(x, a) = wa x parameterized by K weight vectors {wa}a 1,...,K and use least-squares regression to fit wa based on a partially labeled dataset from the training set. For each dataset, we repeat step 4, N = 500 times.