Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Design of Estimators for Bandit Off-Policy Evaluation
Authors: Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our main results in the context of multi-armed bandits, and we decribe a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets. We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014). In Table 1 we report the RMSE of the different estimators for each benchmark. |
| Researcher Affiliation | Collaboration | 1Netflix, Los Gatos CA, USA 2Department of Biostatistics, University of California Berkeley, Berkeley, USA. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The code for all experiments is available by request from the authors. |
| Open Datasets | Yes | We use the same 9 benchmark datasets from the UCI repository (Dua & Graff, 2017; Asuncion & Newman, 2007) as in Dud ık et al. (2014). |
| Dataset Splits | No | The paper mentions splitting data into training and test sets but does not provide specific details on a separate validation split or the exact percentages/counts for these splits. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory) used for experiments are provided. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned (e.g., library names with their versions). |
| Experiment Setup | Yes | For the evaluation on a dataset, we follow the methodology of Dud ık et al. (2014). We randomly split data into training and test sets of the same size. We run logistic regression to obtain a classifier π logging policy µ selects label π(x) with probability ϵ = 0.05 and with probability 1 ϵ the logging policy µ selects one of the other labels {1, 2, . . . , K}\π(x) uniformly at random. we use a linear loss model ˆr(x, a) = wa x parameterized by K weight vectors {wa}a 1,...,K and use least-squares regression to fit wa based on a partially labeled dataset from the training set. For each dataset, we repeat step 4, N = 500 times. |