Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning
Authors: Nathan Kallus, Xiaojie Mao, Kaiwen Wang, Zhengyuan Zhou
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our algorithms in simulations |
| Researcher Affiliation | Collaboration | 1Cornell University and Cornell Tech 2Tsinghua University 3Arena Technologies and New York University. |
| Pseudocode | Yes | Algorithm 1 Localized Doubly Robust DROPE; Algorithm 2 Continuum Doubly Robust DROPL |
| Open Source Code | Yes | Code is available at https://github.com/Causal ML/ doubly-robust-dropel. |
| Open Datasets | No | The paper describes a simulated data generating process and does not use or provide access information for a publicly available or open dataset. |
| Dataset Splits | Yes | Randomly split D into K (approximately) even folds, with the indices of the kth fold denoted as Ik. All models were fitted with K = 5 fold cross-fitting |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments. |
| Software Dependencies | No | The paper mentions using the "Light GBM package (Ke et al., 2017)" and "Adam with a learning rate of 0.01" but does not provide specific version numbers for these software components or other libraries. |
| Experiment Setup | Yes | The state space is two-dimensional S = [ 1, 1]2, and states are sampled uniformly S Unif([ 1, 1]2). The action space is A = {0, 1, . . . , 4}, and the behavior policy is a softmax policy π0(a | s) exp(2s βa), where βa s are the coordinates of the k-th fifth root of unity, i.e. βa = (Re ζa, Im ζa) where ζa = exp(2aπi/5). Potential outcomes are normally distributed: R(a) | S = s N(s βa, σ2 a), where σ = [0.1, 0.2, 0.3, 0.4, 0.5]. We conducted experiments under three uncertainty set radii δ = 0.1, 0.2, 0.3, and in two settings, where propensities π0 were known and unknown. All models were fitted with K = 5 fold cross-fitting. In CDR2OPL, the continuum of regression functions { bf0(s, a); α} was estimated according to Section 4.1, with weights bωi(s, a) derived from fitting a Random Forest with 25 trees. Our policies were neural network softmax policies with a hidden layer of 32 neurons and Re LU activation. For Line 10, we minimized c W DR( , α) using Adam with a learning rate of 0.01. Following Dud ık et al. (2011), we repeated each policy update ten times with perturbed starting weights and picked the best weights based on training objective |