Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Universal Off-Policy Evaluation
Authors: Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide empirical support for the established theoretical results for the proposed Un O estimator and high-confidence bounds. To do so, we use the following domains: (1) An open source implementation [102] of the FDA-approved type-1 diabetes treatment simulator [59], (2) A stationary and a non-stationary recommender system domain, and (3) A continuous-state Gridworld with partial observability, where data is collected using multiple behavior policies. |
| Researcher Affiliation | Academia | Yash Chandak University of Massachusetts Scott Niekum University of Texas Austin Bruno Castro da Silva University of Massachusetts Erik Learned-Miller University of Massachusetts Emma Brunskill Stanford University Philip S. Thomas University of Massachusetts |
| Pseudocode | Yes | This procedure is outlined in Algorithm 1 in Appendix E.4. |
| Open Source Code | Yes | code for the proposed Un O method(s) and the domains used for empirical studies are available https://github.com/yashchandak/Un O. |
| Open Datasets | Yes | To do so, we use the following domains: (1) An open source implementation [102] of the FDA-approved type-1 diabetes treatment simulator [59], (2) A stationary and a non-stationary recommender system domain, and (3) A continuous-state Gridworld with partial observability, where data is collected using multiple behavior policies. [103] J. Xie. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/ simglucose. |
| Dataset Splits | No | The paper mentions collecting data (e.g., '3 104.5 samples') but does not specify explicit training, validation, or test dataset splits with percentages, absolute counts, or predefined split methods. |
| Hardware Specification | Yes | All the experiments were conducted on a personal computer with 32 Gi B of memory and an Intel Core i7 CPU with 12 threads. |
| Software Dependencies | No | Our code uses Julia [11], Blackboxoptim library [34], Python [97], and Py Torch [68]. |
| Experiment Setup | Yes | Bounds were obtained for a failure rate δ = 0.05. |