DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
Authors: Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, Rohan Kadekodi
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now compare Vamana with other relevant algorithms for approximate nearest neighbor search. First, for in-memory search, we compare our algorithm with NSG [13] and HNSW [21], which offer best-in-class latency vs recall on most public benchmark datasets. Next, for large billion point datasets, we compare Disk ANN with compression based techniques such as FAISS [18] and IVF-OADC+G+P [8]. |
| Researcher Affiliation | Collaboration | Suhas Jayaram Subramanya Carnegie Mellon University suhas@cmu.edu Devvrit University of Texas at Austin devvrit.03@gmail.com Rohan Kadekodi University of Texas at Austin rak@cs.texas.edu Ravishankar Krishaswamy Microsoft Research India rakri@microsoft.com Harsha Vardhan Simhadri Microsoft Research India harshasi@microsoft.com |
| Pseudocode | Yes | Algorithm 1: Greedy Search(s, xq, k, L) ... Algorithm 2: Robust Prune(p, V, α, R) ... Algorithm 3: Vamana Indexing algorithm |
| Open Source Code | No | The paper refers to existing open-source code for other methods (e.g., NSG repository in footnote 5, IVFOADC+G+P in Section 4.4) but does not provide an explicit statement or link for the source code of Disk ANN or Vamana developed in this paper. |
| Open Datasets | Yes | We compared Vamana with HNSW and NSG on three commonly used public benchmarks: SIFT1M (128-dimensions) and GIST1M (960-dimensions), both of which are million point datasets of image descriptors [1], and DEEP1M (96-dimensions), a random one million size sample of DEEP1B, a machine-learned set of one billion vectors [6]. |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test splits, only mentioning datasets used for evaluation. |
| Hardware Specification | Yes | We use the following two machines for all experiments. z840: a bare-metal mid-range workstation with dual Xeon E5-2620v4s (16 cores), 64GB DDR4 RAM, and two Samsung 960 EVO 1TB SSDs in RAID-0 configuration. M64-32ms: a virtual machine with dual Xeon E7-8890v3s (32-v CPUs) with 1792GB DDR3 RAM that we use to build a one-shot in-memory index for billion point datasets. |
| Software Dependencies | No | The paper mentions various algorithms and systems like FAISS, HNSW, NSG, IVFOADC+G+P, and Product Quantization, and refers to their open-source code. However, it does not specify software dependencies with version numbers for their own implementation (e.g., Python 3.x, PyTorch 1.x, CUDA 1x.x). |
| Experiment Setup | Yes | For all three algorithms, we did a parameter sweep and selected near-optimal choice of parameters for the best recall vs latency trade-off. All HNSW indices were constructed using M = 128, ef C = 512, while Vamana indices used L = 125, R = 70, C = 3000, α = 2. For NSG on SIFT1M and GIST1M, we use the parameters listed on their repository5 due to their excellent performance, and used R = 60, L = 70, C = 500 for DEEP1M. |