reproducibilityindex.ai

Sliding Window Algorithms for k-Clustering Problems

Authors: Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, Morteza Zadimoghaddam

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with costs only slightly higher than those returned by algorithms with access to the full dataset. We now describe the methodology of our empirical evaluation before providing our experiments results.
Researcher Affiliation	Industry	Michele Borassi Google Zürich borassi@google.com Alessandro Epasto Google Research New York aepasto@google.com Silvio Lattanzi Google Research Zürich silviol@google.com Sergei Vassilviskii Google Research New York sergeiv@google.com Morteza Zadimoghaddam Google Research Cambridge zadim@google.com
Pseudocode	Yes	Algorithm 1 Meyerson Sketches, Compute Sketches(X, w, λ, m, M, ) Algorithm 2 Our main algorithm. Input: X, m, M, , approx. factor of ALG (β) and δ.
Open Source Code	Yes	Our code is available open-source on github6. 6https://github.com/google-research/google-research/tree/master/sliding_window_ clustering/
Open Datasets	Yes	We used 3 real-world datasets from the UCI Repository [28] that have been used in previous experiments on k-clustering for data streams settings: SKINTYPE [12], n = 245057, d = 4, SHUTTLE, n = 58000, d = 9, and COVERTYPE [13], n = 581012, d = 54. We also use 4 publicly-available synthetic dataset from [31] (the S-Set series) that have ground-truth clusters.
Dataset Splits	No	The paper mentions streaming points in natural or random order for the datasets but does not specify any explicit training, validation, or test dataset splits or methods like cross-validation.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies	No	The paper mentions using 'k-means++ [4] as the solver ALG' but does not specify any version numbers for this or any other software dependencies, which would be necessary for reproducibility.
Experiment Setup	Yes	Parameters. We vary the number of centers, k, from 4 to 40 and window size, w, from 10,000 to 40,000. We experiment with δ = [0.1, 0.2] and set ϵ = 0.05 (empirically the results are robust to wide settings of ϵ).