SOAR: Improved Indexing for Approximate Nearest Neighbor Search

Authors: Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments, We benchmark SOAR and achieve state-of-the-art performance, outperforming standard VQ indices, spilled VQ indices trained without SOAR, and all other approaches to ANN search that were submitted to the benchmark.
Researcher Affiliation Industry Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar Google Research {sunphil,dsimcha,ddopson,guorq,sanjivk}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The supplementary materials contains the source code to generate this section s plots.
Open Datasets Yes The Glove-1M dataset came from ann-benchmarks.com [3], while the Microsoft SPACEV and Microsoft Turing-ANNS datasets came from big-ann-benchmarks.com.
Dataset Splits No The paper mentions using benchmark datasets like Glove-1M, Microsoft SPACEV, and Microsoft Turing-ANNS, but does not explicitly specify the training, validation, and test splits with percentages, sample counts, or references to specific split methodologies they used for their experiments.
Hardware Specification Yes The SOAR benchmark results came from running on a server using 32 v CPU (16 physical cores) on an Intel Cascade Lake generation processor with 150GB of memory. The Supermicro SYS-510P-M configured with: 1 x Intel R Xeon R Silver 4314 Processor 16-Core 2.40 GHz 24MB Cache (135W) 6 x 32GB DDR4 3200MHz ECC RDIMM Server Memory (2Rx8 16Gb) 1 x 1TB 3.5" MG04ACA 7200 RPM SATA3 6Gb/s 128M Cache 512N Hard Drive should provide an upper bound for the cost of the SOAR benchmark setup...
Software Dependencies No The paper mentions tools like ScaNN and FAISS and benchmark platforms like ann-benchmarks.com and big-ann-benchmarks.com, but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific library versions).
Experiment Setup Yes Glove-1M was trained on an anisotropic loss [8] with 2000 partitions, and SOAR was run with λ = 1. The two billion-datapoint datasets were trained on an anisotropic loss with approximately 7.2 million partitions, and SOAR was run with λ = 1.5. ... The PQ quantization was configured with 16 subspaces and s = 2 dimensions per subspace...