reproducibilityindex.ai

Stars: Tera-Scale Graph Building for Clustering and Learning

Authors: CJ Carey, Jonathan Halcrow, Rajesh Jayaram, Vahab Mirrokni, Warren Schudy, Peilin Zhong

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss. ... We then provide an empirical evaluation of Stars in Section 5, where we demonstrate that Stars performs signiﬁcantly better in practice than even the theoretical bounds suggest.
Researcher Affiliation	Industry	CJ Carey Google Research cjcarey@google.com Jonathan Halcrow Google Research halcrow@google.com Rajesh Jayaram Google Research rkjayaram@google.com Vahab Mirrokni Google Research mirrokni@google.com Warren Schudy Google Research wschudy@google.com Peilin Zhong Google Research peilinz@google.com
Pseudocode	Yes	Stars 1: Constructing Approximate Threshold Graphs Input: Point set P, similarity measure µ, and a (r1, r2, ρ)-sensitive hash family H. ... Stars 2: Constructing Approximate Nearest Neighbor Graphs Input: Point set P, a hash family H, ANN parameters k, ρ, M > 0, recall parameter δ (0, 1).
Open Source Code	No	The paper mentions implementing Stars as part of the Grale [25] system but does not provide any link or explicit statement about making their own code for Stars open source.
Open Datasets	Yes	We run experiments on three real public datasets: MNIST [30], Amazon2m [9] (also known as OGBN-Products [28]), and Wikipedia [39], and two synthetic datasets: random1B and random10B.
Dataset Splits	No	The paper mentions using specific datasets for evaluation but does not provide explicit training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits) for its own model's training or evaluation.
Hardware Specification	No	Our implementation has two primary phases: generating LSH tables using LSH or Sorting LSH, then scoring pairs of points that share a sketch using all-pairs or Stars. ... Each logical unit of computation is automatically distributed across a number of worker machines, with the experiments in this paper scaling to thousands of individual workers.
Software Dependencies	No	We have implemented Stars as part of the Grale [25] graph building system using Flume a C++ counterpart to Flume Java [13]. The paper does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	For all algorithms, we use the number of sketches R = 400. ... Set the window size to be W = 16k ... sample s = c2δ 1nρ log2 n uniformly random leaders ... An additional important implementation detail is the limit we impose on LSH bucket sizes.