Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation
Authors: Shengpu Tang, Jenna Wiens
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. 5 Experiments First, through a suite of simple bandit problems, we verify the theoretical properties of C-IS. Then, we apply our approach to a healthcare-inspired RL simulation domain, where we compare the performance of our proposed approach, C-PDIS, to several baselines in terms of their OPE accuracy and ability to rank policies, and explore robustness to bias, noise, and missingness in the annotations. |
| Researcher Affiliation | Academia | Shengpu Tang, Jenna Wiens Computer Science & Engineering University of Michigan, Ann Arbor, MI, USA {tangsp,wiensj}@umich.edu |
| Pseudocode | No | The paper provides mathematical definitions and recursive formulas for the estimators (e.g., Definition 2, Definition 4), but it does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for all experiments is available at https://github.com/MLD3/Counterfactual Annot-Semi OPE. |
| Open Datasets | Yes | Next, we apply our approach to evaluate policies in a simulated RL domain modeled after the physiology of sepsis patients [32]. Following prior work [22], we collected 50 offline datasets from the sepsis simulator (using different random seeds) each with 1000 episodes by following an ϵ-greedy behavior policy with respect to the optimal policy where ϵ = 0.1. |
| Dataset Splits | No | The paper describes collecting offline datasets and evaluating policies on them but does not specify explicit training, validation, and test dataset splits for the collected data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing specifications used for running experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries with their versions). |
| Experiment Setup | Yes | For the sepsis simulator: 'Following prior work [22], we collected 50 offline datasets from the sepsis simulator (using different random seeds) each with 1000 episodes by following an ϵ-greedy behavior policy with respect to the optimal policy where ϵ = 0.1.' For the bandit experiments: 'In Table 1, we display the results for R(s1, ) = 1, R(s1, ) = 2, R(s2, ) = 1, σ = 0.5.' |