On the Design of Estimators for Bandit Off-Policy Evaluation
Authors: Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our main results in the context of multi-armed bandits, and we decribe a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets. We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014). In Table 1 we report the RMSE of the different estimators for each benchmark. |
| Researcher Affiliation | Collaboration | 1Netflix, Los Gatos CA, USA 2Department of Biostatistics, University of California Berkeley, Berkeley, USA. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The code for all experiments is available by request from the authors. |
| Open Datasets | Yes | We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014). |
| Dataset Splits | No | The paper mentions splitting data into training and test sets but does not provide specific details on a separate validation split or the exact percentages/counts for these splits. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory) used for experiments are provided. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned (e.g., library names with their versions). |
| Experiment Setup | Yes | For the evaluation on a dataset, we follow the methodology of Dud ık et al. (2014). We randomly split data into training and test sets of the same size. We run logistic regression to obtain a classifier π logging policy µ selects label π(x) with probability ϵ = 0.05 and with probability 1 ϵ the logging policy µ selects one of the other labels {1, 2, . . . , K}\π(x) uniformly at random. we use a linear loss model ˆr(x, a) = wa x parameterized by K weight vectors {wa}a 1,...,K and use least-squares regression to fit wa based on a partially labeled dataset from the training set. For each dataset, we repeat step 4, N = 500 times. |