Greedy Column Subset Selection: New Bounds and Distributed Algorithms

Authors: Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, Morteza Zadimoghaddam

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study the greedy algorithm for the column subset selection problem from a theoretical and empirical perspective and show its effectiveness in a distributed setting. Finally, we validate the effectiveness of this distributed algorithm via an empirical study. In this section we present an empirical investigation of the GREEDY, GREEDY++ and DISTGREEDY algorithms.
Researcher Affiliation Collaboration Jason Altschuler JASONMA@PRINCETON.EDU Princeton University, Princeton, NJ 08544; Aditya Bhaskara BHASKARA@CS.UTAH.EDU School of Computing, 50 S. Central Campus Drive, Salt Lake City, UT 84112; Gang (Thomas) Fu THOMASFU@GOOGLE.COM Vahab Mirrokni MIRROKNI@GOOGLE.COM Afshin Rostamizadeh ROSTAMI@GOOGLE.COM Morteza Zadimoghaddam ZADIM@GOOGLE.COM Google, 76 9th Avenue, New York, NY 10011
Pseudocode Yes Algorithm 1 GREEDY(A Rm n A, B Rm n B, k n B); Algorithm 2 DISTGREEDY(A, B, k, ℓ)
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper. No links or explicit statements about code release are present.
Open Datasets Yes We investigate using these algorithms using two datasets, one with a small set of columns (mnist)... as well as a sparse dataset with a large number of columns (news20.binary)... Both datasets can be found at: www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html.
Dataset Splits No The paper specifies training and test sets but does not explicitly mention a validation dataset split. For MNIST, it states "60,000 instances to train with and 10,000 instances for our test set." For news20.binary, it states "14,996 examples to train with and hold-out 5,000 examples to test with."
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using the "LIBLINEAR library (Fan et al., 2008)" for training a linear SVM model, but it does not specify a version number for this library or any other software dependencies.
Experiment Setup Yes For both datasets we report test error for the best choice of regularization parameter c {10 3, . . . , 104}. We run GREEDY++ and DISTGREEDY with n k log(10) marginal gain evaluations per iteration and the distributed algorithm uses s = p n k machines with each machine recieving n k columns.