Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits

Authors: Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.
Researcher Affiliation Collaboration Muhammad Faaiz Taufiq Department of Statistics University of Oxford Arnaud Doucet Department of Statistics University of Oxford Rob Cornish Department of Statistics University of Oxford Jean-François Ton Byte Dance Research Byte Dance
Pseudocode No The paper describes methods using mathematical formulations and textual descriptions but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce our experiments has been made available at: github.com/faaiz T/MR-OPE.
Open Datasets Yes We consider five UCI classification datasets [37] as well as Mnist [38] and CIFAR-100 [39] datasets.
Dataset Splits No The paper consistently refers to 'training datasets' and 'evaluation datasets' (which serve as test sets) with specific sizes (m and n). However, it does not explicitly define a separate 'validation' dataset split for purposes like hyperparameter tuning, nor does it provide common three-way split percentages.
Hardware Specification Yes We ran our experiments on Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz with 8GB RAM per core.
Software Dependencies No The paper mentions software components like 'random forest' and 'multi-layer perceptrons (MLP)' but does not provide specific version numbers for these or any other libraries or frameworks used.
Experiment Setup Yes For our synthetic data experiment, we reproduce the experimental setup for the synthetic data experiment in [14] by reusing their code with minor modifications. Specifically, X Rd, for various values of d as described below. Likewise, the action space A = {0, . . . , na 1}, with na taking a range of different values. Additional details regarding the reward function, behaviour policy πb, and the estimation of weights ˆw(y) have been included in Appendix F.2 for completeness. ... for MR, we split the training data to estimate bπb and ˆw(y), whereas for all other baselines we use the entire training data to estimate bπb for a fair comparison. ... we used a fully connected neural network with three hidden layers with 512, 256 and 32 nodes respectively (and Re LU activation function) to estimate the weights ˆw(y).