Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimal Off-Policy Evaluation from Multiple Logging Policies
Authors: Nathan Kallus, Yuta Saito, Masatoshi Uehara
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the benefits of our methods efficiently leveraging of the stratified sampling of off-policy data from multiple loggers. |
| Researcher Affiliation | Academia | 1Cornell University, NY, USA . Correspondence to: Masatoshi Uehara <EMAIL> |
| Pseudocode | Yes | Algorithm 1 Feasible Cross-Fold Version of Γ(D; h, g) |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | We evaluate our estimators using multiclass classification datasets from the UCI repository. Here we consider the optdigits and pendigits datasets (see Table 3 in Appendix E.). |
| Dataset Splits | Yes | We split the original data into training (30%) and evaluation (70%) sets. |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware used to run the experiments (e.g., GPU/CPU models, cloud instances). |
| Software Dependencies | No | The paper mentions "We use tensorflow" but does not provide any version numbers for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | We split the original data into training (30%) and evaluation (70%) sets. ... We vary ρ1/(1 ρ1) = n1/n2 in {0.1, 0.25, 0.5, 1, 2, 4, 10}. ... We repeat the process M = 200 times with different random seeds ... For all estimators, we estimate the logging policies using logistic regression on the evaluation set with 2-fold cross-fitting as in Algorithm 1. ... For DR, DR-Avg, and DR-PW, we construct q-estimates using logistic regression again using 2-fold cross-fitting as in Algorithm 1. |