Streaming and Distributed Algorithms for Robust Column Subset Selection

Authors: Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, David Woodruff

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Experiments Our experimental results confirm that our algorithms give stable low ℓ1 error in both distributed and streaming settings. We further demonstrate the robustness of the ℓ1 norm loss for k-CSS. ... The results for the streaming setting and the distributed setting are presented in Figure 1 and Figure 2 respectively.
Researcher Affiliation Academia 1School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: David P. Woodruff <dwoodruf@andrew.cmu.edu>, Shuli Jiang <shulij@andrew.cmu.edu>, Arvind V. Mahankali <amahanka@andrew.cmu.edu>.
Pseudocode Yes Algorithm 1 A one-pass streaming algorithm for bi-criteria k-CSSp in the column-update streaming model. ... Algorithm 2 Recursive Merge ... Algorithm 3 A one-round protocol for bi-criteria k-CSSp in the column partition model ... Algorithm 4 Greedy k-CSSp,2.
Open Source Code Yes The source code is available at: https://github.com/11hifish/robust_css.
Open Datasets Yes Datasets. ... 4http://gabrilovich.com/resources/data/ techtc/techtc300/techtc300.html ... 5https://archive.ics.uci.edu/ml/datasets/ gene+expression+cancer+RNA-Seq ... COIL207 dataset. COIL20 contains a total of 1440 images. ... 7http://www.cs.columbia.edu/CAVE/ software/softlib/coil-20.php
Dataset Splits No We randomly split the dataset into 80% training set (1152 samples) and 20% testing set (288 samples). No explicit mention of a 'validation' split.
Hardware Specification Yes All the experiments are conducted on AWS EC2 c5a.8xlarge machines with 32 v CPUs and 64GB EBS memory.
Software Dependencies No Our algorithms are implemented with Python Ray6, a high-level framework for parallel and distributed computing, and are thus highly scalable. No specific version numbers for Python or Ray are provided.
Experiment Setup Yes For an input data matrix A Rd n, we set the number of rows of our 1-stable (Cauchy) sketching matrix to be 0.5d in both settings. In the streaming setting, we set the batch size to be 5k and maintain a list of coresets of size 2k. In the distributed setting, we set the number of servers to be 5 and each server sends the coordinator a coreset of size 2k.