reproducibilityindex.ai

Streaming and Distributed Algorithms for Robust Column Subset Selection

Authors: Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, David Woodruff

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7. Experiments Our experimental results conﬁrm that our algorithms give stable low ℓ1 error in both distributed and streaming settings. We further demonstrate the robustness of the ℓ1 norm loss for k-CSS. ... The results for the streaming setting and the distributed setting are presented in Figure 1 and Figure 2 respectively.
Researcher Affiliation	Academia	1School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: David P. Woodruff <dwoodruf@andrew.cmu.edu>, Shuli Jiang <shulij@andrew.cmu.edu>, Arvind V. Mahankali <amahanka@andrew.cmu.edu>.
Pseudocode	Yes	Algorithm 1 A one-pass streaming algorithm for bi-criteria k-CSSp in the column-update streaming model. ... Algorithm 2 Recursive Merge ... Algorithm 3 A one-round protocol for bi-criteria k-CSSp in the column partition model ... Algorithm 4 Greedy k-CSSp,2.
Open Source Code	Yes	The source code is available at: https://github.com/11hifish/robust_css.
Open Datasets	Yes	Datasets. ... 4http://gabrilovich.com/resources/data/ techtc/techtc300/techtc300.html ... 5https://archive.ics.uci.edu/ml/datasets/ gene+expression+cancer+RNA-Seq ... COIL207 dataset. COIL20 contains a total of 1440 images. ... 7http://www.cs.columbia.edu/CAVE/ software/softlib/coil-20.php
Dataset Splits	No	We randomly split the dataset into 80% training set (1152 samples) and 20% testing set (288 samples). No explicit mention of a 'validation' split.
Hardware Specification	Yes	All the experiments are conducted on AWS EC2 c5a.8xlarge machines with 32 v CPUs and 64GB EBS memory.
Software Dependencies	No	Our algorithms are implemented with Python Ray6, a high-level framework for parallel and distributed computing, and are thus highly scalable. No specific version numbers for Python or Ray are provided.
Experiment Setup	Yes	For an input data matrix A Rd n, we set the number of rows of our 1-stable (Cauchy) sketching matrix to be 0.5d in both settings. In the streaming setting, we set the batch size to be 5k and maintain a list of coresets of size 2k. In the distributed setting, we set the number of servers to be 5 and each server sends the coordinator a coreset of size 2k.