reproducibilityindex.ai

Leveraging Well-Conditioned Bases: Streaming and Distributed Summaries in Minkowski $p$-Norms

Authors: Charlie Dickens, Graham Cormode, David Woodruff

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 7 concludes with an empirical evaluation.
Researcher Affiliation	Academia	1Department of Computer Science, University of Warwick, Coventry, UK 2School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Pseudocode	Yes	Algorithm 1 Deterministic High Leverage Scores; Algorithm 2 Finding high leverage rows
Open Source Code	Yes	1Code available at https://github.com/c-dickens/ stream-summaries-high-lev-rows
Open Datasets	Yes	Datasets. We tested the methods on a subset of the US Census Data containing 5 million rows and 11 columns2 and Year Prediction MSD3 which has roughly 500,000 rows and 90 columns... 2http://www.census.gov/census2000/PUMS5. html 3https://archive.ics.uci.edu/ml/datasets/ yearpredictionmsd
Dataset Splits	No	The paper mentions evaluating on datasets and using "5 independent trials with random permutations of the data" and "single pass streaming model with a ﬁxed space constraint", but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or predefined splits) for reproducibility.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud computing instances).
Software Dependencies	No	The paper describes algorithms and methods (e.g., SPC3, QR decomposition) and mentions solving a "linear program," but it does not list specific software components with their version numbers (e.g., Python 3.x, PyTorch 1.x, CPLEX 12.x).
Experiment Setup	Yes	For the census dataset, space constraints between 50,000 and 500,000 rows were tested and for the Year Predictions MSD data space budgets were tested between 2,500 and 25,000. The implementation is carried out in the single pass streaming model with a ﬁxed space constraint, m, and threshold, αp/m for both conditioning methods to ensure the number of rows kept in the summary did not exceed m.