Stars: Tera-Scale Graph Building for Clustering and Learning
Authors: CJ Carey, Jonathan Halcrow, Rajesh Jayaram, Vahab Mirrokni, Warren Schudy, Peilin Zhong
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss. ... We then provide an empirical evaluation of Stars in Section 5, where we demonstrate that Stars performs significantly better in practice than even the theoretical bounds suggest. |
| Researcher Affiliation | Industry | CJ Carey Google Research cjcarey@google.com Jonathan Halcrow Google Research halcrow@google.com Rajesh Jayaram Google Research rkjayaram@google.com Vahab Mirrokni Google Research mirrokni@google.com Warren Schudy Google Research wschudy@google.com Peilin Zhong Google Research peilinz@google.com |
| Pseudocode | Yes | Stars 1: Constructing Approximate Threshold Graphs Input: Point set P, similarity measure µ, and a (r1, r2, ρ)-sensitive hash family H. ... Stars 2: Constructing Approximate Nearest Neighbor Graphs Input: Point set P, a hash family H, ANN parameters k, ρ, M > 0, recall parameter δ (0, 1). |
| Open Source Code | No | The paper mentions implementing Stars as part of the Grale [25] system but does not provide any link or explicit statement about making their own code for Stars open source. |
| Open Datasets | Yes | We run experiments on three real public datasets: MNIST [30], Amazon2m [9] (also known as OGBN-Products [28]), and Wikipedia [39], and two synthetic datasets: random1B and random10B. |
| Dataset Splits | No | The paper mentions using specific datasets for evaluation but does not provide explicit training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits) for its own model's training or evaluation. |
| Hardware Specification | No | Our implementation has two primary phases: generating LSH tables using LSH or Sorting LSH, then scoring pairs of points that share a sketch using all-pairs or Stars. ... Each logical unit of computation is automatically distributed across a number of worker machines, with the experiments in this paper scaling to thousands of individual workers. |
| Software Dependencies | No | We have implemented Stars as part of the Grale [25] graph building system using Flume a C++ counterpart to Flume Java [13]. The paper does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all algorithms, we use the number of sketches R = 400. ... Set the window size to be W = 16k ... sample s = c2δ 1nρ log2 n uniformly random leaders ... An additional important implementation detail is the limit we impose on LSH bucket sizes. |