Sliding Window Algorithms for k-Clustering Problems
Authors: Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, Morteza Zadimoghaddam
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with costs only slightly higher than those returned by algorithms with access to the full dataset. We now describe the methodology of our empirical evaluation before providing our experiments results. |
| Researcher Affiliation | Industry | Michele Borassi Google Zürich borassi@google.com Alessandro Epasto Google Research New York aepasto@google.com Silvio Lattanzi Google Research Zürich silviol@google.com Sergei Vassilviskii Google Research New York sergeiv@google.com Morteza Zadimoghaddam Google Research Cambridge zadim@google.com |
| Pseudocode | Yes | Algorithm 1 Meyerson Sketches, Compute Sketches(X, w, λ, m, M, ) Algorithm 2 Our main algorithm. Input: X, m, M, , approx. factor of ALG (β) and δ. |
| Open Source Code | Yes | Our code is available open-source on github6. 6https://github.com/google-research/google-research/tree/master/sliding_window_ clustering/ |
| Open Datasets | Yes | We used 3 real-world datasets from the UCI Repository [28] that have been used in previous experiments on k-clustering for data streams settings: SKINTYPE [12], n = 245057, d = 4, SHUTTLE, n = 58000, d = 9, and COVERTYPE [13], n = 581012, d = 54. We also use 4 publicly-available synthetic dataset from [31] (the S-Set series) that have ground-truth clusters. |
| Dataset Splits | No | The paper mentions streaming points in natural or random order for the datasets but does not specify any explicit training, validation, or test dataset splits or methods like cross-validation. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using 'k-means++ [4] as the solver ALG' but does not specify any version numbers for this or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | Parameters. We vary the number of centers, k, from 4 to 40 and window size, w, from 10,000 to 40,000. We experiment with δ = [0.1, 0.2] and set ϵ = 0.05 (empirically the results are robust to wide settings of ϵ). |