Stars: Tera-Scale Graph Building for Clustering and Learning

Authors: CJ Carey, Jonathan Halcrow, Rajesh Jayaram, Vahab Mirrokni, Warren Schudy, Peilin Zhong

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss. ... We then provide an empirical evaluation of Stars in Section 5, where we demonstrate that Stars performs significantly better in practice than even the theoretical bounds suggest.
Researcher Affiliation Industry CJ Carey Google Research cjcarey@google.com Jonathan Halcrow Google Research halcrow@google.com Rajesh Jayaram Google Research rkjayaram@google.com Vahab Mirrokni Google Research mirrokni@google.com Warren Schudy Google Research wschudy@google.com Peilin Zhong Google Research peilinz@google.com
Pseudocode Yes Stars 1: Constructing Approximate Threshold Graphs Input: Point set P, similarity measure µ, and a (r1, r2, ρ)-sensitive hash family H. ... Stars 2: Constructing Approximate Nearest Neighbor Graphs Input: Point set P, a hash family H, ANN parameters k, ρ, M > 0, recall parameter δ (0, 1).
Open Source Code No The paper mentions implementing Stars as part of the Grale [25] system but does not provide any link or explicit statement about making their own code for Stars open source.
Open Datasets Yes We run experiments on three real public datasets: MNIST [30], Amazon2m [9] (also known as OGBN-Products [28]), and Wikipedia [39], and two synthetic datasets: random1B and random10B.
Dataset Splits No The paper mentions using specific datasets for evaluation but does not provide explicit training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits) for its own model's training or evaluation.
Hardware Specification No Our implementation has two primary phases: generating LSH tables using LSH or Sorting LSH, then scoring pairs of points that share a sketch using all-pairs or Stars. ... Each logical unit of computation is automatically distributed across a number of worker machines, with the experiments in this paper scaling to thousands of individual workers.
Software Dependencies No We have implemented Stars as part of the Grale [25] graph building system using Flume a C++ counterpart to Flume Java [13]. The paper does not specify version numbers for these or other software dependencies.
Experiment Setup Yes For all algorithms, we use the number of sketches R = 400. ... Set the window size to be W = 16k ... sample s = c2δ 1nρ log2 n uniformly random leaders ... An additional important implementation detail is the limit we impose on LSH bucket sizes.