DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Authors: Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, Rohan Kadekodi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now compare Vamana with other relevant algorithms for approximate nearest neighbor search. First, for in-memory search, we compare our algorithm with NSG [13] and HNSW [21], which offer best-in-class latency vs recall on most public benchmark datasets. Next, for large billion point datasets, we compare Disk ANN with compression based techniques such as FAISS [18] and IVF-OADC+G+P [8].
Researcher Affiliation Collaboration Suhas Jayaram Subramanya Carnegie Mellon University suhas@cmu.edu Devvrit University of Texas at Austin devvrit.03@gmail.com Rohan Kadekodi University of Texas at Austin rak@cs.texas.edu Ravishankar Krishaswamy Microsoft Research India rakri@microsoft.com Harsha Vardhan Simhadri Microsoft Research India harshasi@microsoft.com
Pseudocode Yes Algorithm 1: Greedy Search(s, xq, k, L) ... Algorithm 2: Robust Prune(p, V, α, R) ... Algorithm 3: Vamana Indexing algorithm
Open Source Code No The paper refers to existing open-source code for other methods (e.g., NSG repository in footnote 5, IVFOADC+G+P in Section 4.4) but does not provide an explicit statement or link for the source code of Disk ANN or Vamana developed in this paper.
Open Datasets Yes We compared Vamana with HNSW and NSG on three commonly used public benchmarks: SIFT1M (128-dimensions) and GIST1M (960-dimensions), both of which are million point datasets of image descriptors [1], and DEEP1M (96-dimensions), a random one million size sample of DEEP1B, a machine-learned set of one billion vectors [6].
Dataset Splits No The paper does not explicitly provide specific train/validation/test splits, only mentioning datasets used for evaluation.
Hardware Specification Yes We use the following two machines for all experiments. z840: a bare-metal mid-range workstation with dual Xeon E5-2620v4s (16 cores), 64GB DDR4 RAM, and two Samsung 960 EVO 1TB SSDs in RAID-0 configuration. M64-32ms: a virtual machine with dual Xeon E7-8890v3s (32-v CPUs) with 1792GB DDR3 RAM that we use to build a one-shot in-memory index for billion point datasets.
Software Dependencies No The paper mentions various algorithms and systems like FAISS, HNSW, NSG, IVFOADC+G+P, and Product Quantization, and refers to their open-source code. However, it does not specify software dependencies with version numbers for their own implementation (e.g., Python 3.x, PyTorch 1.x, CUDA 1x.x).
Experiment Setup Yes For all three algorithms, we did a parameter sweep and selected near-optimal choice of parameters for the best recall vs latency trade-off. All HNSW indices were constructed using M = 128, ef C = 512, while Vamana indices used L = 125, R = 70, C = 3000, α = 2. For NSG on SIFT1M and GIST1M, we use the parameters listed on their repository5 due to their excellent performance, and used R = 60, L = 70, C = 500 for DEEP1M.