Streaming and Distributed Algorithms for Robust Column Subset Selection
Authors: Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, David Woodruff
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Experiments Our experimental results confirm that our algorithms give stable low ℓ1 error in both distributed and streaming settings. We further demonstrate the robustness of the ℓ1 norm loss for k-CSS. ... The results for the streaming setting and the distributed setting are presented in Figure 1 and Figure 2 respectively. |
| Researcher Affiliation | Academia | 1School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: David P. Woodruff <dwoodruf@andrew.cmu.edu>, Shuli Jiang <shulij@andrew.cmu.edu>, Arvind V. Mahankali <amahanka@andrew.cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 A one-pass streaming algorithm for bi-criteria k-CSSp in the column-update streaming model. ... Algorithm 2 Recursive Merge ... Algorithm 3 A one-round protocol for bi-criteria k-CSSp in the column partition model ... Algorithm 4 Greedy k-CSSp,2. |
| Open Source Code | Yes | The source code is available at: https://github.com/11hifish/robust_css. |
| Open Datasets | Yes | Datasets. ... 4http://gabrilovich.com/resources/data/ techtc/techtc300/techtc300.html ... 5https://archive.ics.uci.edu/ml/datasets/ gene+expression+cancer+RNA-Seq ... COIL207 dataset. COIL20 contains a total of 1440 images. ... 7http://www.cs.columbia.edu/CAVE/ software/softlib/coil-20.php |
| Dataset Splits | No | We randomly split the dataset into 80% training set (1152 samples) and 20% testing set (288 samples). No explicit mention of a 'validation' split. |
| Hardware Specification | Yes | All the experiments are conducted on AWS EC2 c5a.8xlarge machines with 32 v CPUs and 64GB EBS memory. |
| Software Dependencies | No | Our algorithms are implemented with Python Ray6, a high-level framework for parallel and distributed computing, and are thus highly scalable. No specific version numbers for Python or Ray are provided. |
| Experiment Setup | Yes | For an input data matrix A Rd n, we set the number of rows of our 1-stable (Cauchy) sketching matrix to be 0.5d in both settings. In the streaming setting, we set the batch size to be 5k and maintain a list of coresets of size 2k. In the distributed setting, we set the number of servers to be 5 and each server sends the coordinator a coreset of size 2k. |