Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Distributed Bootstrap for Simultaneous Inference Under High Dimensionality
Authors: Yang Yu, Shih-Kang Chao, Guang Cheng
JMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material. |
| Researcher Affiliation | Academia | Yang Yu EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA; Shih-Kang Chao EMAIL Department of Statistics University of Missouri Columbia, MO 65211, USA; Guang Cheng EMAIL Department of Statistics University of California, Los Angeles Los Angeles, CA 90095, USA |
| Pseudocode | Yes | Algorithm 1 k-grad/n+k-1-grad with de-biased ℓ1-CSL estimator; Algorithm 2 Distributed K-fold cross-validation for t-step CSL; Algorithm 3 Dist Boots(method, eθ, {gj}k j=1, eΘ); Algorithm 4 Node(c M); Algorithm 5 Simultaneous inference for distributed data with heteroscedasticity |
| Open Source Code | Yes | The code to reproduce the numerical results is available in Supplementary Material. |
| Open Datasets | Yes | The US Airline On-Time Performance dataset (DVN, 2008), available at http://stat-computing. org/dataexpo/2009 |
| Dataset Splits | Yes | We randomly sample a dataset D1 of N = 500,000 observations, and conceptually distribute them across k = 1,000 nodes such that each node receives n = 500 observations. We randomly sample another dataset D2 of N = 500,000 observations for a pilot study to select relevant variables, where D1 D2 = . |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We consider a Gaussian linear model and a logistic regression model. We fix total sample size N = 214 and the dimension d = 210, and choose the number of machines k from {22, 23, . . . , 26}. The true coefficient θ is a d-dimensional vector in which the first s0 coordinates are 1 and the rest is 0, where s0 {22, 24} for the linear model and s0 {21, 23} for the GLM. ... For the ℓ1-CSL computation, we choose the initial λ(0) by a local K-fold cross-validation, where K = 10 for linear regression and K = 5 for logistic regression. For each iteration t, λ(t) is selected by Algorithm 2 in Section 2.4 with K folds with K = min{k 1, 5} ... At each replication, we draw B = 500 bootstrap samples, from which we calculate the 95% empirical quantile to further obtain the 95% simultaneous confidence interval. |