Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Anytime-valid, Bayes-assisted, Prediction-Powered Inference

Authors: Valentin Kilian, Stefano Cortinovis, Francois Caron

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Experiments We compare the PPI and PPI++ Asymp CS procedures introduced in Section 5 with and without Bayes assistance to the Asymp CS relying solely on labelled data (obtained from Theorem 1 and referred to as classical ) on several estimation problems... 6.1 Synthetic data... 6.2 Real data
Researcher Affiliation	Academia	Valentin Kilian Department of Statistics, University of Oxford EMAIL; Stefano Cortinovis Department of Statistics, University of Oxford EMAIL; François Caron Department of Statistics, University of Oxford EMAIL
Pseudocode	No	The paper describes methods conceptually and mathematically, including propositions and theorems, but does not present a formal pseudocode or algorithm block.
Open Source Code	Yes	As stated in the supplementary material, the code used to perform our experiments is made available online under a permissive licence.
Open Datasets	Yes	We evaluate our method on several real-world datasets, which are described in Section S6.2. ... Figure 3 compares classical and PPI++ Asymp CS procedures on the FLIGHTS, FOREST, and GALAXIES datasets... Figure S8 reports results for three additional estimation tasks: linear regression (CENSUS), logistic regression (HEALTHCARE), and quantile estimation (GENES)... All datasets have permissive licenses and are properly credited in the supplementary material.
Dataset Splits	Yes	we simulate an online setting akin to Section 6.1 by randomly splitting the data into a labelled set of size n1, serving as a labelled data stream, and an unlabelled set of size N.
Hardware Specification	Yes	All experiments were run locally on an Apple Silicon M4 Pro CPU with 24GB of memory, and implementation details are provided in the supplementary material.
Software Dependencies	No	The main text of the paper does not specify software dependencies with version numbers. It mentions 'implementation details are provided in the supplementary material', but these are not in the main paper.
Experiment Setup	Yes	For synthetic data, we set N = unlabelled samples { e Xj}N j=1 iid PX and successively sample n labelled observations (Xi, Yi)n i=1 iid P with the goal of estimating the mean θ = E[Y ]. ... Noisy predictions. This experiment demonstrates that our method can adapt to varying correlation levels between predictions and true labels by using the PPI++ estimator (23). We sample Yi iid N(0, 1), so that θ = E[Y ] = 0. The prediction rule is defined as f(Xi) = Yi + ϵi, where Xi is only used for indexing and ϵi iid N(0, σ2 Y ), with the noise level σY {0.1, 0.8, 3}. ... Biased predictions. This experiment illustrates the potential benefits of incorporating prior information into our method. We sample Xi iid N(0, 1) and Yi = Xi + ϵi, where ϵi iid tdf(0, 1), so that θ = E[Y ] = 0. The prediction rule is defined as f(Xi) = Xi + υ, where υ R controls its bias level. ... We vary υ between 1.2 and 1.2, and df {5, 10, } to study the impact of bias level and noise distribution on the Asymp CS procedures.