Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Authors: Nathan Kallus, Xiaojie Mao, Kaiwen Wang, Zhengyuan Zhou

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our algorithms in simulations
Researcher Affiliation Collaboration 1Cornell University and Cornell Tech 2Tsinghua University 3Arena Technologies and New York University.
Pseudocode Yes Algorithm 1 Localized Doubly Robust DROPE; Algorithm 2 Continuum Doubly Robust DROPL
Open Source Code Yes Code is available at https://github.com/Causal ML/ doubly-robust-dropel.
Open Datasets No The paper describes a simulated data generating process and does not use or provide access information for a publicly available or open dataset.
Dataset Splits Yes Randomly split D into K (approximately) even folds, with the indices of the kth fold denoted as Ik. All models were fitted with K = 5 fold cross-fitting
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments.
Software Dependencies No The paper mentions using the "Light GBM package (Ke et al., 2017)" and "Adam with a learning rate of 0.01" but does not provide specific version numbers for these software components or other libraries.
Experiment Setup Yes The state space is two-dimensional S = [ 1, 1]2, and states are sampled uniformly S Unif([ 1, 1]2). The action space is A = {0, 1, . . . , 4}, and the behavior policy is a softmax policy π0(a | s) exp(2s βa), where βa s are the coordinates of the k-th fifth root of unity, i.e. βa = (Re ζa, Im ζa) where ζa = exp(2aπi/5). Potential outcomes are normally distributed: R(a) | S = s N(s βa, σ2 a), where σ = [0.1, 0.2, 0.3, 0.4, 0.5]. We conducted experiments under three uncertainty set radii δ = 0.1, 0.2, 0.3, and in two settings, where propensities π0 were known and unknown. All models were fitted with K = 5 fold cross-fitting. In CDR2OPL, the continuum of regression functions { bf0(s, a); α} was estimated according to Section 4.1, with weights bωi(s, a) derived from fitting a Random Forest with 25 trees. Our policies were neural network softmax policies with a hidden layer of 32 neurons and Re LU activation. For Line 10, we minimized c W DR( , α) using Adam with a learning rate of 0.01. Following Dud ık et al. (2011), we repeated each policy update ten times with perturbed starting weights and picked the best weights based on training objective