Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions
Authors: Omer Gottesman, Joseph Futoma, Yao Liu, Sonali Parbhoo, Leo Celi, Emma Brunskill, Finale Doshi-Velez
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on medical simulations and real-world intensive care unit data demonstrate that our method can be used to identify limitations in the evaluation process and make evaluation more robust. |
| Researcher Affiliation | Academia | 1Harvard University 2Stanford University 3MIT. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for reproducing the results in this paper can be found at https://github.com/dtak/interpretable_ope_public.git |
| Open Datasets | Yes | Our data source is a subset of the publicly available MIMIC-III dataset (Johnson et al., 2016). |
| Dataset Splits | No | Our final dataset consists of 346 patient trajectories (6777 transitions) for learning a policy and another 346 trajectories (6863 transitions) for evaluation of the policy via OPE and influence analysis. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In all figures, we highlight in red all influential transitions our method would have highlighted for review by domain experts ( Ic = 0.05). As an evaluation policy, we use the most common action of a state s 50 nearest neighbors. |