SOAR: Improved Indexing for Approximate Nearest Neighbor Search
Authors: Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, Sanjiv Kumar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments, We benchmark SOAR and achieve state-of-the-art performance, outperforming standard VQ indices, spilled VQ indices trained without SOAR, and all other approaches to ANN search that were submitted to the benchmark. |
| Researcher Affiliation | Industry | Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar Google Research {sunphil,dsimcha,ddopson,guorq,sanjivk}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The supplementary materials contains the source code to generate this section s plots. |
| Open Datasets | Yes | The Glove-1M dataset came from ann-benchmarks.com [3], while the Microsoft SPACEV and Microsoft Turing-ANNS datasets came from big-ann-benchmarks.com. |
| Dataset Splits | No | The paper mentions using benchmark datasets like Glove-1M, Microsoft SPACEV, and Microsoft Turing-ANNS, but does not explicitly specify the training, validation, and test splits with percentages, sample counts, or references to specific split methodologies they used for their experiments. |
| Hardware Specification | Yes | The SOAR benchmark results came from running on a server using 32 v CPU (16 physical cores) on an Intel Cascade Lake generation processor with 150GB of memory. The Supermicro SYS-510P-M configured with: 1 x Intel R Xeon R Silver 4314 Processor 16-Core 2.40 GHz 24MB Cache (135W) 6 x 32GB DDR4 3200MHz ECC RDIMM Server Memory (2Rx8 16Gb) 1 x 1TB 3.5" MG04ACA 7200 RPM SATA3 6Gb/s 128M Cache 512N Hard Drive should provide an upper bound for the cost of the SOAR benchmark setup... |
| Software Dependencies | No | The paper mentions tools like ScaNN and FAISS and benchmark platforms like ann-benchmarks.com and big-ann-benchmarks.com, but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific library versions). |
| Experiment Setup | Yes | Glove-1M was trained on an anisotropic loss [8] with 2000 partitions, and SOAR was run with λ = 1. The two billion-datapoint datasets were trained on an anisotropic loss with approximately 7.2 million partitions, and SOAR was run with λ = 1.5. ... The PQ quantization was configured with 16 subspaces and s = 2 dimensions per subspace... |