Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data-driven Design of Randomized Control Trials with Guaranteed Treatment Effects

Authors: Santiago Cortes-Gomez, Naveen Janaki Raman, Aarti Singh, Bryan Wilder

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assess our two-stage RCT design with both synthetic and real-world datasets. Synthetic Dataset and Setup We construct a synthetic dataset to evaluate our two-stage RCT designs. We sample arm means, µ, from a uniform 0-1 distribution (we experiment with other choices in Appendix E). We compare our two-stage design against baselines and find that our sample splitting methods improve upon baselines. In Figure 1, we find that our sample splitting methods outperform single-stage methods across first-stage percentages.
Researcher Affiliation	Academia	1Department of Machine Learning, Carnegie Mellon University. Correspondence to: Santiago Cortes Gomez <EMAIL>.
Pseudocode	Yes	Algorithm 1 Sample splitting design 1: Input: s1 iid samples. 2: Output: Set π(X) 3: Split first stage data randomly into two sets: U = {x1, ..., x s1 2 } and V = {z1, ..., z s1 2 }.
Open Source Code	No	1We include all code and datasets at hidden Explanation: The paper states "We include all code and datasets at hidden", which is a placeholder typically used during double-blind review and does not provide concrete access to the code.
Open Datasets	No	We run semi-synthetic experiments where effect sizes are drawn accordingly to a realworld distribution drawn from a meta-analysis of treatments in gerontology (Greising et al., 2009). Explanation: The paper uses a meta-analysis by Greising et al. (2009) as a source for effect sizes to generate a semi-synthetic dataset, rather than using the meta-analysis itself as a direct, publicly available dataset for experiments. It also mentions a 'synthetic dataset' which is generated by the authors. No concrete access information (link, DOI, specific repository) is provided for any dataset used in the experiments.
Dataset Splits	No	We sample arm means, µ, from a uniform 0-1 distribution (we experiment with other choices in Appendix E). Arms have Bernoulli outcomes with mean µi, which simulates settings where treatment are successful with probability µi. We fix n = 10 (we find similar results for other n in Appendix D) and δ = 0.1 (and find similar results for other δ in Appendix B). Explanation: The paper describes generating synthetic data and semi-synthetic experiments. It does not mention traditional dataset splits like training, validation, or test sets with specific percentages or counts. The budgets s1 and s2 refer to sample allocation across stages, not fixed dataset splits.
Hardware Specification	No	Explanation: The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	Explanation: The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	We sample arm means, µ, from a uniform 0-1 distribution (we experiment with other choices in Appendix E). Arms have Bernoulli outcomes with mean µi, which simulates settings where treatment are successful with probability µi. We fix n = 10 (we find similar results for other n in Appendix D) and δ = 0.1 (and find similar results for other δ in Appendix B). We compare the following RCT designs: 1. Random Two-stage top-k design + random k 2. Best Arm Two-stage design with k = 1 3. Single-stage 4. Sample Split Our proposed two-stage method uses the first stage to prune arms and the second stage to compute certificates 5. Omniscient -A two-stage method which computes k with knowledge of µ.