Leveraging Well-Conditioned Bases: Streaming and Distributed Summaries in Minkowski $p$-Norms
Authors: Charlie Dickens, Graham Cormode, David Woodruff
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 7 concludes with an empirical evaluation. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Warwick, Coventry, UK 2School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. |
| Pseudocode | Yes | Algorithm 1 Deterministic High Leverage Scores; Algorithm 2 Finding high leverage rows |
| Open Source Code | Yes | 1Code available at https://github.com/c-dickens/ stream-summaries-high-lev-rows |
| Open Datasets | Yes | Datasets. We tested the methods on a subset of the US Census Data containing 5 million rows and 11 columns2 and Year Prediction MSD3 which has roughly 500,000 rows and 90 columns... 2http://www.census.gov/census2000/PUMS5. html 3https://archive.ics.uci.edu/ml/datasets/ yearpredictionmsd |
| Dataset Splits | No | The paper mentions evaluating on datasets and using "5 independent trials with random permutations of the data" and "single pass streaming model with a fixed space constraint", but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or predefined splits) for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud computing instances). |
| Software Dependencies | No | The paper describes algorithms and methods (e.g., SPC3, QR decomposition) and mentions solving a "linear program," but it does not list specific software components with their version numbers (e.g., Python 3.x, PyTorch 1.x, CPLEX 12.x). |
| Experiment Setup | Yes | For the census dataset, space constraints between 50,000 and 500,000 rows were tested and for the Year Predictions MSD data space budgets were tested between 2,500 and 25,000. The implementation is carried out in the single pass streaming model with a fixed space constraint, m, and threshold, αp/m for both conditioning methods to ensure the number of rows kept in the summary did not exceed m. |