Clustering High Dimensional Dynamic Data Streams
Authors: Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, Lin F. Yang
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. Our coreset uses log2 +2 = 11 levels of grids. The storage in each level is very sparse. As shown in Figure 1(a), only 90 points are stored in total. We compared the 1-median costs estimated using the coreset and the dataset, the resulting difference is very small, as illustrated in Figure 1(b). |
| Researcher Affiliation | Collaboration | 1Johns Hopkins University, USA 2Linguee Gmb H 3TU Dortmund. Correspondence to: Lin F. Yang <lyang@jhu.edu>, Christian Sohler <christian.sohler@tu-dortmund.de>. |
| Pseudocode | Yes | Algorithm 1 Core Set(S, k, ρ, ϵ): construct a ϵ-coreset for dynamic stream S. ... Algorithm 2 Get Freq(e, HH, KS, πi): retrieve the correct freuquency of cell center e, given the instance of HEAVY-HITTER and K-set. ... Algorithm 3 Rectify Weights c |C1|, d |C2| . . . , d |Ck |, S : input the estimates of number of points in each cell and the weighted sampled points, output a weighted coreset with non-negative weights. |
| Open Source Code | No | We leave the full implementation as a future project. |
| Open Datasets | No | We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. The paper generates its own dataset and does not provide access information or citations for a publicly available dataset. |
| Dataset Splits | No | The paper describes generating its own data and uses it for illustration of an offline construction in Section 5. It does not mention or define any specific training, validation, or test splits for the data. |
| Hardware Specification | No | Section 5 describes the experimental setup but does not provide any details about the specific hardware used (e.g., CPU, GPU models, memory, or cloud instances). |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers used for its implementation or experiments. |
| Experiment Setup | Yes | We illustrate our construction using an offline construction on Gaussian mixture data in R2. As shown in Figure 2 in Section D, we randomly generated 65536 points from R2, then rounded the points to a grid of size = 512. Our coreset uses log2 +2 = 11 levels of grids. The storage in each level is very sparse. |