Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Online Sequential Test for Qualitative Treatment Effects

Authors: Chengchun Shi, Shikai Luo, Hongtu Zhu, Rui Song

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical studies are conducted to examine the ﬁnite sample performance of our test procedure. (Abstract); In this section, we conduct Monte Carlo simulations to examine the ﬁnite sample properties of the proposed test. (4.1.1); In this section, we apply the proposed method to a Yahoo! Today Module user click log dataset (4.2).
Researcher Affiliation	Collaboration	Chengchun Shi EMAIL Department of Statistics, London School of Economics and Political Science, Shikai Luo EMAIL Tecent PCG, Hongtu Zhu EMAIL Department of Biostatistics, University of North-Carolina, Rui Song EMAIL Department of Statistics, North-Carolina State University
Pseudocode	Yes	Algorithm 1: the Pseudocode that summarizing the online bootstrap testing procedure.
Open Source Code	No	No explicit statement or link to the source code for the methodology described in this paper is provided.
Open Datasets	Yes	In this section, we apply the proposed method to a Yahoo! Today Module user click log dataset1, which contains 45,811,883 user visits to the Today Module, during the ﬁrst ten days in May 2009. ... 1. https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49
Dataset Splits	No	The paper describes a sequential online testing procedure and how users are dynamically assigned to different arms in A/B experiments, rather than providing fixed training/test/validation dataset splits for model training and evaluation.
Hardware Specification	Yes	We run our experiments on a single computer instance with 40 Intel(R) Xeon(R) 2.20GHz CPUs. It takes 1-2 seconds on average to compute each test.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers for reproducibility.
Experiment Setup	Yes	We generated the potential outcomes as Y i (a) = 1 + (Xi1 Xi2)/2 + aτ(Xi) + εi, where εi s are i.i.d N(0, 0.52). The covariates Xi = (Xi1, Xi2, Xi3) were generated as follows... We consider two randomization designs... In addition, we set N(T1) = 2000 and N(Tj) N(Tj 1) = 2n for 2 j K and some n > 0. We consider two combinations of (n, K), corresponding to (n, K) = (200, 5) and (20, 50). We set the signiﬁcance level α = 0.05 and choose B = 10000. We set τ(Xi) = φδ{(Xi1 + Xi2)/ 2}X2 i3 for some function φδ parameterized by some δ 0... For all settings, we construct the basis function ϕ( ) using additive cubic splines. For each univariate spline, we set the number of internal knots to be 4. These knots are equally spaced between [ 2, 2].