Leveraging Well-Conditioned Bases: Streaming and Distributed Summaries in Minkowski $p$-Norms

Authors: Charlie Dickens, Graham Cormode, David Woodruff

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 7 concludes with an empirical evaluation.
Researcher Affiliation Academia 1Department of Computer Science, University of Warwick, Coventry, UK 2School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Pseudocode Yes Algorithm 1 Deterministic High Leverage Scores; Algorithm 2 Finding high leverage rows
Open Source Code Yes 1Code available at https://github.com/c-dickens/ stream-summaries-high-lev-rows
Open Datasets Yes Datasets. We tested the methods on a subset of the US Census Data containing 5 million rows and 11 columns2 and Year Prediction MSD3 which has roughly 500,000 rows and 90 columns... 2http://www.census.gov/census2000/PUMS5. html 3https://archive.ics.uci.edu/ml/datasets/ yearpredictionmsd
Dataset Splits No The paper mentions evaluating on datasets and using "5 independent trials with random permutations of the data" and "single pass streaming model with a fixed space constraint", but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or predefined splits) for reproducibility.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud computing instances).
Software Dependencies No The paper describes algorithms and methods (e.g., SPC3, QR decomposition) and mentions solving a "linear program," but it does not list specific software components with their version numbers (e.g., Python 3.x, PyTorch 1.x, CPLEX 12.x).
Experiment Setup Yes For the census dataset, space constraints between 50,000 and 500,000 rows were tested and for the Year Predictions MSD data space budgets were tested between 2,500 and 25,000. The implementation is carried out in the single pass streaming model with a fixed space constraint, m, and threshold, αp/m for both conditioning methods to ensure the number of rows kept in the summary did not exceed m.