Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Randomized Experiments Using Foundation Models

Authors: Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond Duch, Fanny Yang, Issa Dahabreh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone. 5 Experiments In this section, we first show that H-AIPW improves statistical precision across eight randomized experiments without compromising empirical coverage.
Researcher Affiliation Academia Piersilvio De Bartolomeis ETH Zurich & Harvard University EMAIL Javier Abad ETH Zurich EMAIL Guanbo Wang Harvard University EMAIL Konstantin Donhauser ETH Zurich EMAIL Raymond M Duch Oxford University EMAIL Fanny Yang ETH Zurich EMAIL Issa J Dahabreh Harvard University EMAIL
Pseudocode Yes Algorithm 1 Hybrid Augmented Inverse Probability Weighting (H-AIPW)
Open Source Code No Code and data will be made available in a public repository upon publication, with detailed instructions for reproducing all experiments.
Open Datasets Yes These studies were selected from the multidisciplinary Time-Sharing Experiments in the Social Sciences (TESS) repository, along the lines of Ashokkumar et al. [4]. For each experimental study s, we implement the following subsampling procedure: [...] Data availability: The study is publicly available at: https://tessexperiments.org/study/silverman1035
Dataset Splits Yes For each experimental study s, we implement the following subsampling procedure: starting with a full dataset D of size Ns, we select a target sample size n. For each subsampling repetition r {1, . . . , R}, we sample n participants without replacement from D, ensuring the treatment and control groups are balanced, to create a smaller dataset Dr.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts for running the experiments. It mentions various LLMs used for predictions, but not the specific hardware environment where the H-AIPW algorithm itself was implemented and evaluated.
Software Dependencies No The AIPW estimator implements cross-fitting with 30 folds, using ridge regression with regularization λ = 1.0 in the standard case and XGBoost with default hyperparameters in the boosting case. For PPCT, we follow the implementation by Poulet et al. [40], using GPT-4o s predictions for the control scenario as the prognostic score. We implement PROCOVA using an AIPW estimator whose outcome regression estimator is augmented with a smart covariate, i.e. the prediction of GPT-4o for both arms. The coefficients for the optimal combination are computed using standard Python libraries.
Experiment Setup Yes For all experiments, we first select the five features most correlated with the outcome variable. The AIPW estimator implements cross-fitting with 30 folds, using ridge regression with regularization λ = 1.0 in the standard case and XGBoost with default hyperparameters in the boosting case.