reproducibilityindex.ai

Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes

Authors: Andrew Bennett, Nathan Kallus, Miruna Oprescu, Wen Sun, Kaiwen Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate these properties in numerical simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisancefunction estimation (orthogonality), and addressing the challenge of learning from finite samples (inference) together leads to credible and reliable policy evaluation.
Researcher Affiliation	Collaboration	Andrew Bennett Morgan Stanley andrew.bennett@morganstanley.com Nathan Kallus Cornell University kallus@cornell.edu Miruna Oprescu Cornell University amo78@cornell.edu Wen Sun Cornell University ws455@cornell.edu Kaiwen Wang Cornell University kw437@cornell.edu
Pseudocode	Yes	Algorithm 1 Robust FQE: Iterative fitting for estimating Q and β τ . Algorithm 2 Robust MIL: Minimax Estimation of w with a Stabilizer Algorithm 3 Orthogonal Estimator for V d1
Open Source Code	Yes	The code for our experiments is open-sourced and available at https://github.com/Causal ML/adversarial-ope/.
Open Datasets	No	For the synthetic environment, the authors generated their own dataset: 'We sampled a dataset of 20,000 tuples using a different fixed logging policy πb'. For the medical application, they generated a dataset from a simulator based on MIMIC-III: 'we generated a fixed offline dataset consisting of 20,000 observed tuples of state, action, reward, and next state.' No direct access or statement of public availability is provided for these specific generated datasets.
Dataset Splits	Yes	Specifically, we used the first 10,000 tuples for estimating nuisances, and the second 10,000 tuples for the final estimators.
Hardware Specification	No	The paper states 'our experiment is a proof of concept and can be run on a standard GPU', but it does not provide specific details such as the model, memory, or number of GPUs used.
Software Dependencies	No	The paper mentions using 'neural nets' and 'linear sieves' for implementation, and refers to 'PPO: [63]' and 'DQL: [53]' for policy training, but it does not specify any software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	The task is to estimate the worst-case policy value V d1 of a fixed target policy πt, across four different constant values of the sensitivity parameter: Λ(s, a) {1, 2, 4, 8}. We sampled a dataset of 20,000 tuples using a different fixed logging policy πb; (2) fit the nuisance functions Q , β , and w following the method outlined in Algorithms 1 and 2 for each Λ; and (3) estimated the corresponding robust policy value V d1 for all estimators using the fitted nuisances.