Clustering High Dimensional Dynamic Data Streams

Authors: Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, Lin F. Yang

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. Our coreset uses log2 +2 = 11 levels of grids. The storage in each level is very sparse. As shown in Figure 1(a), only 90 points are stored in total. We compared the 1-median costs estimated using the coreset and the dataset, the resulting difference is very small, as illustrated in Figure 1(b).
Researcher Affiliation Collaboration 1Johns Hopkins University, USA 2Linguee Gmb H 3TU Dortmund. Correspondence to: Lin F. Yang <lyang@jhu.edu>, Christian Sohler <christian.sohler@tu-dortmund.de>.
Pseudocode Yes Algorithm 1 Core Set(S, k, ρ, ϵ): construct a ϵ-coreset for dynamic stream S. ... Algorithm 2 Get Freq(e, HH, KS, πi): retrieve the correct freuquency of cell center e, given the instance of HEAVY-HITTER and K-set. ... Algorithm 3 Rectify Weights c |C1|, d |C2| . . . , d |Ck |, S : input the estimates of number of points in each cell and the weighted sampled points, output a weighted coreset with non-negative weights.
Open Source Code No We leave the full implementation as a future project.
Open Datasets No We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. The paper generates its own dataset and does not provide access information or citations for a publicly available dataset.
Dataset Splits No The paper describes generating its own data and uses it for illustration of an offline construction in Section 5. It does not mention or define any specific training, validation, or test splits for the data.
Hardware Specification No Section 5 describes the experimental setup but does not provide any details about the specific hardware used (e.g., CPU, GPU models, memory, or cloud instances).
Software Dependencies No The paper does not mention any specific software dependencies with version numbers used for its implementation or experiments.
Experiment Setup Yes We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. Our coreset uses log2 +2 = 11 levels of grids. The storage in each level is very sparse.