Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Statistical Perspective on Algorithmic Leveraging

Authors: Ping Ma, Michael W. Mahoney, Bin Yu

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance.
Researcher Affiliation	Academia	Ping Ma EMAIL Department of Statistics University of Georgia Athens, GA 30602 Michael W. Mahoney EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720 Bin Yu EMAIL Department of Statistics University of California at Berkeley Berkeley, CA 94720
Pseudocode	Yes	A prototypical example of this approach is given by the following meta-algorithm (Drineas et al., 2006; Mahoney, 2011; Drineas et al., 2012), which we call Subsample LS, and which takes as input an n p matrix X, where n p, a vector y, and a probability distribution {πi}n i=1, and which returns as output an approximate solution βols, which is an estimate of ˆβols of Eqn. (3). 1. Randomly sample r > p constraints, i.e., rows of X and the corresponding elements of y, using {πi}n i=1 as an importance sampling distribution. 2. Rescale each sampled row/element by 1/(r πi) to form a weighted LS subproblem. 3. Solve the weighted LS subproblem, formally given in Eqn. (6) below, and then return the solution βols.
Open Source Code	No	No explicit statement about the release of their own source code for the methodology described in this paper is provided. The paper mentions implementations of third-party tools (Blendenpik, LSRN) and that the authors 'have implemented' variants of an algorithm, but does not state their code is publicly available.
Open Datasets	Yes	RNA-Seq data set containing n = 51, 751 read counts from embryonic mouse stem cells (Cloonan et al., 2008). ... microarray data set that was presented in Nielsen et al. (2002) (and also considered in Mahoney and Drineas 2009)
Dataset Splits	No	The paper describes generating synthetic data and using subsampling for real data, specifying the number of subsamples taken and repetition counts (e.g., 'we repeat our sampling 100 times to get 100 estimates'). However, it does not provide explicit training, testing, or validation dataset splits in the conventional sense for model development and evaluation.
Hardware Specification	Yes	In particular, the following results were obtained on a PC with Intel Core i7 Processor and 6 Gbytes RAM running Windows 7, on which we used the software package R, version 2.15.2.
Software Dependencies	Yes	In particular, the following results were obtained on a PC with Intel Core i7 Processor and 6 Gbytes RAM running Windows 7, on which we used the software package R, version 2.15.2.
Experiment Setup	Yes	We consider synthetic data of 1000 runs generated from y = Xβ + ϵ, where ϵ N(0, 9In), where several diﬀerent values of n and p, leading to both very rectangular and moderately rectangular matrices X, are considered. The design matrix X is generated from one of three diﬀerent classes of distributions introduced below. These three distributions were chosen since the ﬁrst has nearly uniform leverage scores, the second has mildly non-uniform leverage scores, and the third has very non-uniform leverage scores. ... where we set β = (110, 0.11p 20, 110)T.