Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Simulation-Based Inference for Adaptive Experiments
Authors: Brian Cho, Aurelien Bibaut, Nathan Kallus
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that our approach achieves the desired coverage while reducing confidence interval widths by up to 50%, with drastic improvements for arms not targeted by the design. Our empirical results demonstrate the key benefits of our simulation-based approach: across both synthetic and real-world data, our simulation-based confidence intervals tend to produce smaller confidence intervals, while maintaining similar coverage (e.g. type I error) as existing approaches. |
| Researcher Affiliation | Collaboration | Brian M Cho Cornell Tech EMAIL Aurélien Bibaut Netflix EMAIL Nathan Kallus Netflix & Cornell Tech EMAIL |
| Pseudocode | Yes | Algorithm 1 Trajectory Simulation |
| Open Source Code | Yes | All code necessary to run the experiment is provided in the supplementary material. |
| Open Datasets | Yes | To assess the performance of our approach on real-world data, we reanalyze the results of an adaptive experiment run by Offer-Westort et al. [23]. |
| Dataset Splits | No | The paper analyzes data sequences generated by interactions between an environment and an experimenter in the multi-armed bandit setting. For our confidence intervals, we test a grid of 100 null values evenly spaced between [0, 1], the range of mean values, with B = 200 simulations per mean value. The paper does not provide explicit training/test/validation splits as typically understood for fixed datasets, instead focusing on adaptive data collection. |
| Hardware Specification | Yes | All computational results in both the main body and the appendix were run locally on a 14-inch Mac Book Pro with an Apple M2 Pro chip and 16GB of memory. |
| Software Dependencies | No | The paper mentions 'Python' was used for formatting figures, but does not specify its version or any other software dependencies (libraries, frameworks, solvers) with version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | For all experiments, we set type I error rate α = 0.1, and set ϵ = log log NT (a)/ p NT (a), which satisfies the conditions of Theorem 1. We provide additional details, including runtime, baseline pseudocode, and results for alternative setups, in Appendix C. Synthetic Experiment Setup. For the synthetic experiments, we set K = 3 and set our target parameter to be the mean of arm 1. All arms are distributed according to a Bernoulli distribution with mean vectors µ = [0.45, 0.5, 0.55], [0.5, 0.5, 0.5], and [0.55, 0.5, 0.45] corresponding to the worst arm, no signal, and best arm settings respectively. For our confidence intervals, we test a grid of 100 null values evenly spaced between [0, 1], the range of mean values, with B = 200 simulations per mean value. Real-World Data. For each confidence interval based on our simulation procedure, we test a grid of 200 null values evenly spaced between [0, 1], with B = 1000 simulations per null value, and use the same heuristic as above for our point estimate. |