Graph Reordering for Cache-Efficient Near Neighbor Search

Authors: Benjamin Coleman, Santiago Segarra, Alexander J. Smola, Anshumali Shrivastava

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present exhaustive experiments applying several reordering algorithms to a leading graph-based near neighbor method based on the HNSW index. We find that reordering improves the query time by up to 40%, we present analysis and improvements for existing graph layout methods, and we demonstrate that the time needed to reorder the graph is negligible compared to the time required to construct the index. and We present a rigorous evaluation where we integrate graph reordering into HNSW. We perform an exhaustive comparison on large embedding datasets where we benchmark the query time and cache miss rate.
Researcher Affiliation Collaboration Benjamin Coleman ECE Department Rice University Houston, TX 77005 benjamin.ray.coleman@gmail.com Santiago Segarra ECE Department Rice University Houston, TX 77005 segarra@rice.edu Alex Smola Amazon Web Services alex@smola.org Anshumali Shrivastava Department of Computer Science Rice University Houston, TX 77005 anshumali@rice.edu
Pseudocode Yes We provide pseudocode in the supplementary materials and an applicationagnostic implementation in the code.
Open Source Code Yes We include code, which is not yet released under any license. and Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes We use large datasets that are representative of embedding search tasks. We perform experiments on datasets from ANN-benchmarks as well as on 10M and 100M sized subsets of the SIFT1B and DEEP1B benchmark tasks. and We cite ANNbenchmarks for the datasets. and ANN-benchmarks is released under the MIT license.
Dataset Splits No The paper mentions using 'datasets from ANN-benchmarks' and discusses '10K queries' and '1K queries from the test set' for profiling, but it does not provide specific training, validation, or test dataset split percentages or absolute sample counts for reproducibility.
Hardware Specification Yes We run all of our experiments on a server with 252 GB of RAM, 28 Intel Xeon E5-2697 CPUs, and a shared 36 MB L3 cache.
Software Dependencies No The paper mentions software like 'nmslib implementation of HNSW,' 'Linux perf tool,' and 'cachegrind,' and states that their implementation is in 'C++', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes For construction, HNSW requires two parameters: the maximum number of edges for each node (kc) and the size of the beam search buffer (Mc) used to find the kc neighbors during graph construction. We set Mc = 100 and construct indices for kc {4, 8, 16, 32, 64, 96}. and To query the index, HNSW requires the beam search buffer size (Mq), which controls the trade-off between recall and query latency. ... We use window size w = 5 for Gorder and use the standard hyperparameter settings suggested by the literature for the other methods.