Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bernstein–von Mises for Adaptively Collected Data

Authors: Kevin Du, Yash Nair, Lucas Janson

NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 25,716 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 3,418 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our theory (positive and negative) via a range of simulations.
Researcher Affiliation	Academia	Kevin Du Department of Statistics Harvard University EMAIL Yash Nair Department of Statistics Stanford University EMAIL Lucas Janson Department of Statistics Harvard University EMAIL
Pseudocode	Yes	Algorithm 1 Adaptive Linear Gaussian Sampling Procedure
Open Source Code	Yes	The code for this project can be accessed at https://github.com/TheDukeVin/BvM/tree/main.
Open Datasets	Yes	We compute the BvM total variation distance on a real-world instance of Bernoulli Thompson sampling provided in [28].
Dataset Splits	No	The paper describes adaptive data collection over horizons and batches for simulations (e.g., T = 10^4, 10^4 samples per batch, 12m impressions for a real-world dataset) but does not provide details about splitting a pre-existing dataset into explicit training, validation, or test sets in the conventional machine learning sense. The concept of 'batches' in the batched bandit setting refers to how data is collected, not how an initial dataset is partitioned for model evaluation.
Hardware Specification	No	We mention in Appendix A that the data in each figure require under 3 hours in CPU time to generate.
Software Dependencies	No	Chat GPT-4o was used to create code templates for a Python implementation of the lin-UCB and Stepwise Noisy Certainty Equivalent Control algorithms. The authors revised the templates to ensure correct implementation and modified them to verify the BvM statement.
Experiment Setup	Yes	Figure 1: (Left) Average TV distance measured in the BvM statement for UCB in two-arm Gaussian bandits over horizon T = 10^4 using 10^4 replicates under five different true parameter configurations labelled by [µ1, µ2] where µ1, µ2 are the true means. (Right) Average TV distance measured in the BvM statement for lin-UCB on three-arm Gaussian linear contextual bandits with context distribution N(0, I2x2) under three different true parameter configurations. Standard Gaussian priors are used for all arms. Figure 2: Average BvM TV distance for UCB on Bernoulli bandits and Poisson bandits, under the same configurations as Figure 1. Beta(1, 1) priors are used for the Bernoulli bandit and Gamma(1, 1) priors for the Poisson bandit. Figure 3: (Left) Average BvM TV distance and empirical coverage of the 95% credible interval for the margin for Thompson Sampling in the two-batch two-arm Gaussian bandit setting with 10^4 samples per batch. Error bars are 95% confidence intervals over 2x10^5 replicates. Blacked dotted line is the correct coverage level. N(0, 1) priors are used.