Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Stars: Tera-Scale Graph Building for Clustering and Learning
Authors: CJ Carey, Jonathan Halcrow, Rajesh Jayaram, Vahab Mirrokni, Warren Schudy, Peilin Zhong
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss. ... We then provide an empirical evaluation of Stars in Section 5, where we demonstrate that Stars performs significantly better in practice than even the theoretical bounds suggest. |
| Researcher Affiliation | Industry | CJ Carey Google Research EMAIL Jonathan Halcrow Google Research EMAIL Rajesh Jayaram Google Research EMAIL Vahab Mirrokni Google Research EMAIL Warren Schudy Google Research EMAIL Peilin Zhong Google Research EMAIL |
| Pseudocode | Yes | Stars 1: Constructing Approximate Threshold Graphs Input: Point set P, similarity measure µ, and a (r1, r2, ρ)-sensitive hash family H. ... Stars 2: Constructing Approximate Nearest Neighbor Graphs Input: Point set P, a hash family H, ANN parameters k, ρ, M > 0, recall parameter δ (0, 1). |
| Open Source Code | No | The paper mentions implementing Stars as part of the Grale [25] system but does not provide any link or explicit statement about making their own code for Stars open source. |
| Open Datasets | Yes | We run experiments on three real public datasets: MNIST [30], Amazon2m [9] (also known as OGBN-Products [28]), and Wikipedia [39], and two synthetic datasets: random1B and random10B. |
| Dataset Splits | No | The paper mentions using specific datasets for evaluation but does not provide explicit training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits) for its own model's training or evaluation. |
| Hardware Specification | No | Our implementation has two primary phases: generating LSH tables using LSH or Sorting LSH, then scoring pairs of points that share a sketch using all-pairs or Stars. ... Each logical unit of computation is automatically distributed across a number of worker machines, with the experiments in this paper scaling to thousands of individual workers. |
| Software Dependencies | No | We have implemented Stars as part of the Grale [25] graph building system using Flume a C++ counterpart to Flume Java [13]. The paper does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all algorithms, we use the number of sketches R = 400. ... Set the window size to be W = 16k ... sample s = c2δ 1nρ log2 n uniformly random leaders ... An additional important implementation detail is the limit we impose on LSH bucket sizes. |