Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation
Authors: Shengpu Tang, Jenna Wiens
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. 5 Experiments First, through a suite of simple bandit problems, we verify the theoretical properties of C-IS. Then, we apply our approach to a healthcare-inspired RL simulation domain, where we compare the performance of our proposed approach, C-PDIS, to several baselines in terms of their OPE accuracy and ability to rank policies, and explore robustness to bias, noise, and missingness in the annotations. |
| Researcher Affiliation | Academia | Shengpu Tang, Jenna Wiens Computer Science & Engineering University of Michigan, Ann Arbor, MI, USA EMAIL |
| Pseudocode | No | The paper provides mathematical definitions and recursive formulas for the estimators (e.g., Definition 2, Definition 4), but it does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for all experiments is available at https://github.com/MLD3/Counterfactual Annot-Semi OPE. |
| Open Datasets | Yes | Next, we apply our approach to evaluate policies in a simulated RL domain modeled after the physiology of sepsis patients [32]. Following prior work [22], we collected 50 offline datasets from the sepsis simulator (using different random seeds) each with 1000 episodes by following an ϵ-greedy behavior policy with respect to the optimal policy where ϵ = 0.1. |
| Dataset Splits | No | The paper describes collecting offline datasets and evaluating policies on them but does not specify explicit training, validation, and test dataset splits for the collected data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing specifications used for running experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries with their versions). |
| Experiment Setup | Yes | For the sepsis simulator: 'Following prior work [22], we collected 50 offline datasets from the sepsis simulator (using different random seeds) each with 1000 episodes by following an ϵ-greedy behavior policy with respect to the optimal policy where ϵ = 0.1.' For the bandit experiments: 'In Table 1, we display the results for R(s1, ) = 1, R(s1, ) = 2, R(s2, ) = 1, σ = 0.5.' |