Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Randomized Experiments Using Foundation Models
Authors: Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond Duch, Fanny Yang, Issa Dahabreh
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone. 5 Experiments In this section, we first show that H-AIPW improves statistical precision across eight randomized experiments without compromising empirical coverage. |
| Researcher Affiliation | Academia | Piersilvio De Bartolomeis ETH Zurich & Harvard University EMAIL Javier Abad ETH Zurich EMAIL Guanbo Wang Harvard University EMAIL Konstantin Donhauser ETH Zurich EMAIL Raymond M Duch Oxford University EMAIL Fanny Yang ETH Zurich EMAIL Issa J Dahabreh Harvard University EMAIL |
| Pseudocode | Yes | Algorithm 1 Hybrid Augmented Inverse Probability Weighting (H-AIPW) |
| Open Source Code | No | Code and data will be made available in a public repository upon publication, with detailed instructions for reproducing all experiments. |
| Open Datasets | Yes | These studies were selected from the multidisciplinary Time-Sharing Experiments in the Social Sciences (TESS) repository, along the lines of Ashokkumar et al. [4]. For each experimental study s, we implement the following subsampling procedure: [...] Data availability: The study is publicly available at: https://tessexperiments.org/study/silverman1035 |
| Dataset Splits | Yes | For each experimental study s, we implement the following subsampling procedure: starting with a full dataset D of size Ns, we select a target sample size n. For each subsampling repetition r {1, . . . , R}, we sample n participants without replacement from D, ensuring the treatment and control groups are balanced, to create a smaller dataset Dr. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts for running the experiments. It mentions various LLMs used for predictions, but not the specific hardware environment where the H-AIPW algorithm itself was implemented and evaluated. |
| Software Dependencies | No | The AIPW estimator implements cross-fitting with 30 folds, using ridge regression with regularization λ = 1.0 in the standard case and XGBoost with default hyperparameters in the boosting case. For PPCT, we follow the implementation by Poulet et al. [40], using GPT-4o s predictions for the control scenario as the prognostic score. We implement PROCOVA using an AIPW estimator whose outcome regression estimator is augmented with a smart covariate, i.e. the prediction of GPT-4o for both arms. The coefficients for the optimal combination are computed using standard Python libraries. |
| Experiment Setup | Yes | For all experiments, we first select the five features most correlated with the outcome variable. The AIPW estimator implements cross-fitting with 30 folds, using ridge regression with regularization λ = 1.0 in the standard case and XGBoost with default hyperparameters in the boosting case. |