Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
Authors: Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits. |
| Researcher Affiliation | Collaboration | Muhammad Faaiz Taufiq Department of Statistics University of Oxford Arnaud Doucet Department of Statistics University of Oxford Rob Cornish Department of Statistics University of Oxford Jean-François Ton Byte Dance Research Byte Dance |
| Pseudocode | No | The paper describes methods using mathematical formulations and textual descriptions but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to reproduce our experiments has been made available at: github.com/faaiz T/MR-OPE. |
| Open Datasets | Yes | We consider five UCI classification datasets [37] as well as Mnist [38] and CIFAR-100 [39] datasets. |
| Dataset Splits | No | The paper consistently refers to 'training datasets' and 'evaluation datasets' (which serve as test sets) with specific sizes (m and n). However, it does not explicitly define a separate 'validation' dataset split for purposes like hyperparameter tuning, nor does it provide common three-way split percentages. |
| Hardware Specification | Yes | We ran our experiments on Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz with 8GB RAM per core. |
| Software Dependencies | No | The paper mentions software components like 'random forest' and 'multi-layer perceptrons (MLP)' but does not provide specific version numbers for these or any other libraries or frameworks used. |
| Experiment Setup | Yes | For our synthetic data experiment, we reproduce the experimental setup for the synthetic data experiment in [14] by reusing their code with minor modifications. Specifically, X Rd, for various values of d as described below. Likewise, the action space A = {0, . . . , na 1}, with na taking a range of different values. Additional details regarding the reward function, behaviour policy πb, and the estimation of weights ˆw(y) have been included in Appendix F.2 for completeness. ... for MR, we split the training data to estimate bπb and ˆw(y), whereas for all other baselines we use the entire training data to estimate bπb for a fair comparison. ... we used a fully connected neural network with three hidden layers with 512, 256 and 32 nodes respectively (and Re LU activation function) to estimate the weights ˆw(y). |