Doubly robust off-policy evaluation with shrinkage
Authors: Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudik
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods. and 6. Experiments We evaluate our new estimators on the tasks of off-policy evaluation and off-policy learning and compare their performance with previous estimators. |
| Researcher Affiliation | Collaboration | Yi Su 1 Maria Dimakopoulou 2 Akshay Krishnamurthy 3 Miroslav Dud ık 3 1Cornell University, Ithaca, NY 2Netflix, Los Gatos, CA 3Microsoft Research, New York, NY. |
| Pseudocode | No | The paper describes algorithms and derivations mathematically but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | Following prior work (Dud ık et al., 2014; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2018), we simulate bandit feedback on 9 UCI multi-class classification datasets. and Following Swaminathan et al. (2017), we generate contextual bandit data from the fully labeled MSLR-WEB10K dataset (Qin & Liu, 2013). The UCI ML Repository is also cited: Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. |
| Dataset Splits | Yes | For every dataset, we hold out 25% of the examples to measure ground truth. On the remaining 75% of the dataset, we use logging policy µ to simulate n bandit examples by sampling a context x from the dataset, sampling an action y µ( | x) and then observing a deterministic or stochastic reward r. and In evaluation experiments, we use 1/2 of the bandit data to train ˆη; in learning experiments, we use 1/3 of the bandit data to train ˆη. The remaining bandit data is used to calculate the value of each estimator. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU or CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions methods like 'logistic models', 'linear models via weighted least squares with ℓ2 regularization', and 'ridge regression' but does not specify any software libraries or their version numbers used for implementation or experimentation. |
| Experiment Setup | Yes | We obtain reward predictors ˆη by training linear models via weighted least squares with ℓ2 regularization. We consider weights z(x, a) {1, w(x, a), w2(x, a)} as well as the more robust doubly robust, or MRDR, weight design of Farajtabar et al. (2018) and We solve ℓ2-regularized empirical risk minimization ˆu = argminu ˆV (πu)+γ u 2 via gradient descent, where ˆV is a policy-value estimator and γ > 0 is a hyperparameter. and Table 1 Policy parameters used in the experiments. |