Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Policy Evaluation Across Multiple Different Experimental Datasets

Authors: Yonghan Jung, Alexis Bellot

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verified the robustness of estimators through simulations. In this section, we demonstrate the proposed estimators in Defs. (5,7) for combining multiple experimental datasets from different domains. We first compared the estimators on synthetic data to provide evidence of the fast convergence and doubly robustness behaviours of the proposed estimators. We conclude with an analysis of the ACTG 175 clinical trial [21] and Project STAR.
Researcher Affiliation	Collaboration	Yonghan Jung Purdue University EMAIL Alexis Bellot Independent Researcher EMAIL :Now at Google Deep Mind.
Pseudocode	Yes	Definition 5 (DML for combining two experiments). Let D2 P 2 π2p Vq, D1 P 1 π1p Vq and D0 P 0p Cq. Let L ě 2 denote a fixed number. 1. Sample split: For ℓ 1, , L, randomly split Di for i P t0, 1, 2u into L-fold. The ℓ th partition of the sample is denoted Di ℓ. The complement is Di ℓ: Diz Di ℓ. 2. Nuisance estimation: For each ℓ 1, , L, learn the estimator model ˆµ2 ℓand ˆµ1 ℓfor µ2 0, µ1 0 using samples D2 ℓ, D1 ℓ, respectively. Also, learn the estimation model for ˆω1 ℓ, ˆω2 ℓfor ω1 0, ω2 0 using samples Di ℓfor i 0, 1, 2, respectively. 3. Evaluation: The DML estimator ˆψ for EP 0 π0r Y s is then given as
Open Source Code	Yes	Codes corresponding to simulations are submitted as supplementary materials. [NeurIPS Checklist Q5 Justification]: The code will not be open sourced at this moment but we believe to have provided sufficient details to reproduce our results.
Open Datasets	Yes	We conclude with an analysis of the ACTG 175 clinical trial [21] and Project STAR. The dataset is publicly accessible from the R data repository: https://search.r-project.org/CRAN/refmans/AER/html/STAR.html.
Dataset Splits	Yes	Definition 5 (DML for combining two experiments). 1. Sample split: For ℓ 1, , L, randomly split Di for i P t0, 1, 2u into L-fold. The ℓ th partition of the sample is denoted Di ℓ. The complement is Di ℓ: Diz Di ℓ.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper mentions using "XGBoost [12] to estimate nuisances" but does not specify a version number for XGBoost or any other software dependencies.
Experiment Setup	Yes	We ran 100 simulations for each N t2500, 5000, 10000, 20000u where N is the sample size. To enforce the convergence rate of nuisance estimates no faster than the decaying rate n 1{4, we add ϵ to all nuisance estimates. This scenario is inspired by the experimental design discussed in [27]. The AE plots for combining two/multiple experiments are presented in Figs. (3a, 3b).