reproducibilityindex.ai

SOAR: Improved Indexing for Approximate Nearest Neighbor Search

Authors: Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments, We benchmark SOAR and achieve state-of-the-art performance, outperforming standard VQ indices, spilled VQ indices trained without SOAR, and all other approaches to ANN search that were submitted to the benchmark.
Researcher Affiliation	Industry	Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar Google Research {sunphil,dsimcha,ddopson,guorq,sanjivk}@google.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The supplementary materials contains the source code to generate this section s plots.
Open Datasets	Yes	The Glove-1M dataset came from ann-benchmarks.com [3], while the Microsoft SPACEV and Microsoft Turing-ANNS datasets came from big-ann-benchmarks.com.
Dataset Splits	No	The paper mentions using benchmark datasets like Glove-1M, Microsoft SPACEV, and Microsoft Turing-ANNS, but does not explicitly specify the training, validation, and test splits with percentages, sample counts, or references to specific split methodologies they used for their experiments.
Hardware Specification	Yes	The SOAR benchmark results came from running on a server using 32 v CPU (16 physical cores) on an Intel Cascade Lake generation processor with 150GB of memory. The Supermicro SYS-510P-M conﬁgured with: 1 x Intel R Xeon R Silver 4314 Processor 16-Core 2.40 GHz 24MB Cache (135W) 6 x 32GB DDR4 3200MHz ECC RDIMM Server Memory (2Rx8 16Gb) 1 x 1TB 3.5" MG04ACA 7200 RPM SATA3 6Gb/s 128M Cache 512N Hard Drive should provide an upper bound for the cost of the SOAR benchmark setup...
Software Dependencies	No	The paper mentions tools like ScaNN and FAISS and benchmark platforms like ann-benchmarks.com and big-ann-benchmarks.com, but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific library versions).
Experiment Setup	Yes	Glove-1M was trained on an anisotropic loss [8] with 2000 partitions, and SOAR was run with λ = 1. The two billion-datapoint datasets were trained on an anisotropic loss with approximately 7.2 million partitions, and SOAR was run with λ = 1.5. ... The PQ quantization was conﬁgured with 16 subspaces and s = 2 dimensions per subspace...