Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Statistical Perspective on Algorithmic Leveraging
Authors: Ping Ma, Michael W. Mahoney, Bin Yu
JMLR 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance. |
| Researcher Affiliation | Academia | Ping Ma EMAIL Department of Statistics University of Georgia Athens, GA 30602 Michael W. Mahoney EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720 Bin Yu EMAIL Department of Statistics University of California at Berkeley Berkeley, CA 94720 |
| Pseudocode | Yes | A prototypical example of this approach is given by the following meta-algorithm (Drineas et al., 2006; Mahoney, 2011; Drineas et al., 2012), which we call Subsample LS, and which takes as input an n p matrix X, where n p, a vector y, and a probability distribution {πi}n i=1, and which returns as output an approximate solution βols, which is an estimate of ˆβols of Eqn. (3). 1. Randomly sample r > p constraints, i.e., rows of X and the corresponding elements of y, using {πi}n i=1 as an importance sampling distribution. 2. Rescale each sampled row/element by 1/(r πi) to form a weighted LS subproblem. 3. Solve the weighted LS subproblem, formally given in Eqn. (6) below, and then return the solution βols. |
| Open Source Code | No | No explicit statement about the release of their own source code for the methodology described in this paper is provided. The paper mentions implementations of third-party tools (Blendenpik, LSRN) and that the authors 'have implemented' variants of an algorithm, but does not state their code is publicly available. |
| Open Datasets | Yes | RNA-Seq data set containing n = 51, 751 read counts from embryonic mouse stem cells (Cloonan et al., 2008). ... microarray data set that was presented in Nielsen et al. (2002) (and also considered in Mahoney and Drineas 2009) |
| Dataset Splits | No | The paper describes generating synthetic data and using subsampling for real data, specifying the number of subsamples taken and repetition counts (e.g., 'we repeat our sampling 100 times to get 100 estimates'). However, it does not provide explicit training, testing, or validation dataset splits in the conventional sense for model development and evaluation. |
| Hardware Specification | Yes | In particular, the following results were obtained on a PC with Intel Core i7 Processor and 6 Gbytes RAM running Windows 7, on which we used the software package R, version 2.15.2. |
| Software Dependencies | Yes | In particular, the following results were obtained on a PC with Intel Core i7 Processor and 6 Gbytes RAM running Windows 7, on which we used the software package R, version 2.15.2. |
| Experiment Setup | Yes | We consider synthetic data of 1000 runs generated from y = Xβ + ϵ, where ϵ N(0, 9In), where several different values of n and p, leading to both very rectangular and moderately rectangular matrices X, are considered. The design matrix X is generated from one of three different classes of distributions introduced below. These three distributions were chosen since the first has nearly uniform leverage scores, the second has mildly non-uniform leverage scores, and the third has very non-uniform leverage scores. ... where we set β = (110, 0.11p 20, 110)T. |