Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes

Authors: Andrew Bennett, Nathan Kallus, Miruna Oprescu, Wen Sun, Kaiwen Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate these properties in numerical simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisancefunction estimation (orthogonality), and addressing the challenge of learning from finite samples (inference) together leads to credible and reliable policy evaluation.
Researcher Affiliation Collaboration Andrew Bennett Morgan Stanley andrew.bennett@morganstanley.com Nathan Kallus Cornell University kallus@cornell.edu Miruna Oprescu Cornell University amo78@cornell.edu Wen Sun Cornell University ws455@cornell.edu Kaiwen Wang Cornell University kw437@cornell.edu
Pseudocode Yes Algorithm 1 Robust FQE: Iterative fitting for estimating Q and β τ . Algorithm 2 Robust MIL: Minimax Estimation of w with a Stabilizer Algorithm 3 Orthogonal Estimator for V d1
Open Source Code Yes The code for our experiments is open-sourced and available at https://github.com/Causal ML/adversarial-ope/.
Open Datasets No For the synthetic environment, the authors generated their own dataset: 'We sampled a dataset of 20,000 tuples using a different fixed logging policy πb'. For the medical application, they generated a dataset from a simulator based on MIMIC-III: 'we generated a fixed offline dataset consisting of 20,000 observed tuples of state, action, reward, and next state.' No direct access or statement of public availability is provided for these specific generated datasets.
Dataset Splits Yes Specifically, we used the first 10,000 tuples for estimating nuisances, and the second 10,000 tuples for the final estimators.
Hardware Specification No The paper states 'our experiment is a proof of concept and can be run on a standard GPU', but it does not provide specific details such as the model, memory, or number of GPUs used.
Software Dependencies No The paper mentions using 'neural nets' and 'linear sieves' for implementation, and refers to 'PPO: [63]' and 'DQL: [53]' for policy training, but it does not specify any software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes The task is to estimate the worst-case policy value V d1 of a fixed target policy πt, across four different constant values of the sensitivity parameter: Λ(s, a) {1, 2, 4, 8}. We sampled a dataset of 20,000 tuples using a different fixed logging policy πb; (2) fit the nuisance functions Q , β , and w following the method outlined in Algorithms 1 and 2 for each Λ; and (3) estimated the corresponding robust policy value V d1 for all estimators using the fitted nuisances.