Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Streaming and Distributed Algorithms for Robust Column Subset Selection
Authors: Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, David Woodruff
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Experiments Our experimental results conο¬rm that our algorithms give stable low β1 error in both distributed and streaming settings. We further demonstrate the robustness of the β1 norm loss for k-CSS. ... The results for the streaming setting and the distributed setting are presented in Figure 1 and Figure 2 respectively. |
| Researcher Affiliation | Academia | 1School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: David P. Woodruff <EMAIL>, Shuli Jiang <EMAIL>, Arvind V. Mahankali <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 A one-pass streaming algorithm for bi-criteria k-CSSp in the column-update streaming model. ... Algorithm 2 Recursive Merge ... Algorithm 3 A one-round protocol for bi-criteria k-CSSp in the column partition model ... Algorithm 4 Greedy k-CSSp,2. |
| Open Source Code | Yes | The source code is available at: https://github.com/11hifish/robust_css. |
| Open Datasets | Yes | Datasets. ... 4http://gabrilovich.com/resources/data/ techtc/techtc300/techtc300.html ... 5https://archive.ics.uci.edu/ml/datasets/ gene+expression+cancer+RNA-Seq ... COIL207 dataset. COIL20 contains a total of 1440 images. ... 7http://www.cs.columbia.edu/CAVE/ software/softlib/coil-20.php |
| Dataset Splits | No | We randomly split the dataset into 80% training set (1152 samples) and 20% testing set (288 samples). No explicit mention of a 'validation' split. |
| Hardware Specification | Yes | All the experiments are conducted on AWS EC2 c5a.8xlarge machines with 32 v CPUs and 64GB EBS memory. |
| Software Dependencies | No | Our algorithms are implemented with Python Ray6, a high-level framework for parallel and distributed computing, and are thus highly scalable. No specific version numbers for Python or Ray are provided. |
| Experiment Setup | Yes | For an input data matrix A Rd n, we set the number of rows of our 1-stable (Cauchy) sketching matrix to be 0.5d in both settings. In the streaming setting, we set the batch size to be 5k and maintain a list of coresets of size 2k. In the distributed setting, we set the number of servers to be 5 and each server sends the coordinator a coreset of size 2k. |