Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributed mediation analysis with communication efficiency

Authors: Shaomin Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical analysis and numerical experiments show that, compared to the global test obtained by pooling all data together, the proposed tests achieve nearly identical power, independent of the number of machines. Furthermore, based on these two distributed test statistics, many enhanced mediation tests derived from the Sobel s or Max P tests can be easily adapted to the distributed system. We apply our method to an educational study, testing whether the effect of high school mathematics on college-level Probability and Mathematical Statistics courses is mediated by Calculus.
Researcher Affiliation	Academia	Shaomin Li School of Mathematics and Statistics Beijing Jiaotong University Beijing, 100044 EMAIL
Pseudocode	Yes	Algorithm 1: The Distributed Test of Mediation Effects. Step 1. For k = 1, ..., K, compute the local statistics T (k) β and T (k) γ , then transmit them to the central machine. Step 2. In the central machine, compute the distributed statistics Tβ and Tγ using (4), then compute the distributed Sobel statistic T Dis Sobel and Max P statistic T Dis Max P using (5) and (6), respectively. Step 3. Given the signiﬁcance level α, if \|T Dis Sobel\| > Z1 α/2 or T Dis Max P < α, reject H0.
Open Source Code	Yes	The code provided in the supplemental materials can be used to reproduce results in both simulation study and real data anaalysis.
Open Datasets	No	The data are sourced from three classes, with only summary information of local data available from each class due to student privacy concerns. Our distributed test successfully detects the mediation effect, which would be undetectable using local tests from just the ﬁrst or second class.
Dataset Splits	Yes	In this setting, we set the sample size in each machine are the same, that is, n = N/K. We ﬁx the total sample size N = 211, and set the number of machines K = 1, 2, 4, 8, 16, 32, 64. ... In this setting, we ﬁrst generate the local sample sizes nk randomly from 50 to 150 for k = 1, 2, . . . , 32. Then, we generate a total of K k=1 nk data points. ... We distributed surveys to three classes at a university. ... Class 1: n1 = 34, ... Class 2: n2 = 14, ... Class 3: n3 = 17.
Hardware Specification	Yes	We conducted the tests in R Studio on a Mac Book Pro with an M2 CPU, and each experiment was repeated 2,000 times to calculate the empirical sizes and powers of the tests.
Software Dependencies	No	The paper mentions 'R Studio' but does not specify a version number or any other software dependencies with version numbers used for the experiments.
Experiment Setup	Yes	In this section, we conduct extensive simulation studies to evaluate the performance of the proposed distributed Sobel test and Max P test. We generated the p-dimensional exposure variable A Bernouli(0.5), the covariate X N(0, ΣX) with p = 3 and ΣX = (\|0.5\|j l)p p. The mediator M and the outcome Y were simulated as follows Y = A + βM + βT XX + ϵY , ϵY N(0, 2), (7) M = γA + γT XX + ϵM, ϵM N(0, 1), (8) where βX = (1, 0.5, 1)T , γX = (0.5, 1, 1)T . Under the null hypothesis, we consider three scenarios: (1) (β, γ) = (0.2, 0); (2) (β, γ) = (0, 0.2); (3) (β, γ) = (0, 0). Under the alternatives, we set (β, γ) = (0.2, 0.05), (0.1, 0.1), and (0.05, 0.2). ... The signiﬁcance level α = 0.05. We conducted the tests in R Studio on a Mac Book Pro with an M2 CPU, and each experiment was repeated 2,000 times to calculate the empirical sizes and powers of the tests.