Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes
Authors: Andrew Bennett, Nathan Kallus, Miruna Oprescu, Wen Sun, Kaiwen Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate these properties in numerical simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisancefunction estimation (orthogonality), and addressing the challenge of learning from finite samples (inference) together leads to credible and reliable policy evaluation. |
| Researcher Affiliation | Collaboration | Andrew Bennett Morgan Stanley andrew.bennett@morganstanley.com Nathan Kallus Cornell University kallus@cornell.edu Miruna Oprescu Cornell University amo78@cornell.edu Wen Sun Cornell University ws455@cornell.edu Kaiwen Wang Cornell University kw437@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Robust FQE: Iterative fitting for estimating Q and β τ . Algorithm 2 Robust MIL: Minimax Estimation of w with a Stabilizer Algorithm 3 Orthogonal Estimator for V d1 |
| Open Source Code | Yes | The code for our experiments is open-sourced and available at https://github.com/Causal ML/adversarial-ope/. |
| Open Datasets | No | For the synthetic environment, the authors generated their own dataset: 'We sampled a dataset of 20,000 tuples using a different fixed logging policy πb'. For the medical application, they generated a dataset from a simulator based on MIMIC-III: 'we generated a fixed offline dataset consisting of 20,000 observed tuples of state, action, reward, and next state.' No direct access or statement of public availability is provided for these specific generated datasets. |
| Dataset Splits | Yes | Specifically, we used the first 10,000 tuples for estimating nuisances, and the second 10,000 tuples for the final estimators. |
| Hardware Specification | No | The paper states 'our experiment is a proof of concept and can be run on a standard GPU', but it does not provide specific details such as the model, memory, or number of GPUs used. |
| Software Dependencies | No | The paper mentions using 'neural nets' and 'linear sieves' for implementation, and refers to 'PPO: [63]' and 'DQL: [53]' for policy training, but it does not specify any software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | The task is to estimate the worst-case policy value V d1 of a fixed target policy πt, across four different constant values of the sensitivity parameter: Λ(s, a) {1, 2, 4, 8}. We sampled a dataset of 20,000 tuples using a different fixed logging policy πb; (2) fit the nuisance functions Q , β , and w following the method outlined in Algorithms 1 and 2 for each Λ; and (3) estimated the corresponding robust policy value V d1 for all estimators using the fitted nuisances. |