reproducibilityindex.ai

Doubly robust off-policy evaluation with shrinkage

Authors: Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudik

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods. and 6. Experiments We evaluate our new estimators on the tasks of off-policy evaluation and off-policy learning and compare their performance with previous estimators.
Researcher Affiliation	Collaboration	Yi Su 1 Maria Dimakopoulou 2 Akshay Krishnamurthy 3 Miroslav Dud ık 3 1Cornell University, Ithaca, NY 2Netﬂix, Los Gatos, CA 3Microsoft Research, New York, NY.
Pseudocode	No	The paper describes algorithms and derivations mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	Following prior work (Dud ık et al., 2014; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2018), we simulate bandit feedback on 9 UCI multi-class classiﬁcation datasets. and Following Swaminathan et al. (2017), we generate contextual bandit data from the fully labeled MSLR-WEB10K dataset (Qin & Liu, 2013). The UCI ML Repository is also cited: Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Dataset Splits	Yes	For every dataset, we hold out 25% of the examples to measure ground truth. On the remaining 75% of the dataset, we use logging policy µ to simulate n bandit examples by sampling a context x from the dataset, sampling an action y µ( \| x) and then observing a deterministic or stochastic reward r. and In evaluation experiments, we use 1/2 of the bandit data to train ˆη; in learning experiments, we use 1/3 of the bandit data to train ˆη. The remaining bandit data is used to calculate the value of each estimator.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU or CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions methods like 'logistic models', 'linear models via weighted least squares with ℓ2 regularization', and 'ridge regression' but does not specify any software libraries or their version numbers used for implementation or experimentation.
Experiment Setup	Yes	We obtain reward predictors ˆη by training linear models via weighted least squares with ℓ2 regularization. We consider weights z(x, a) {1, w(x, a), w2(x, a)} as well as the more robust doubly robust, or MRDR, weight design of Farajtabar et al. (2018) and We solve ℓ2-regularized empirical risk minimization ˆu = argminu ˆV (πu)+γ u 2 via gradient descent, where ˆV is a policy-value estimator and γ > 0 is a hyperparameter. and Table 1 Policy parameters used in the experiments.