reproducibilityindex.ai

Automating Nearest Neighbor Search Configuration with Constrained Optimization

Authors: Philip Sun, Ruiqi Guo, Sanjiv Kumar

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.
Researcher Affiliation	Industry	Philip Sun, Ruiqi Guo & Sanjiv Kumar Google Research {sunphil,guorq,sanjivk}@google.com
Pseudocode	Yes	Algorithm 1 Quantization-Based ANN
Open Source Code	No	The paper does not provide explicit statements or links to open-source code for the methodology described.
Open Datasets	Yes	We compare against the grid-searched parameters used by Sca NN (Guo et al., 2020) in its leading performance on the public Glove1M benchmark from Aumüller et al. (2020). Here we use our own implementation of a multi-level quantization ANN index, and benchmark on three datasets from big-ann-benchmarks.com (Simhadri et al., 2022), following the experimental setup stipulated by track 1 of the competition; see Appendix A.6 for more details.
Dataset Splits	Yes	We test generalization capability by randomly splitting the 10^4 queries in the DEEP1B dataset into two equal-sized halves Q1 and Q2. Then we compare the resulting Pareto frontiers of training on Q1 and testing on Q2 (out-of-sample), versus training and testing both on Q2 (in-sample).
Hardware Specification	Yes	Our query-time serving was performed on 16 physical cores from an Intel Cascade Lake-generation CPU, and peak memory usage was measured to be under 58GB. Azure VMs with 32 v CPUs and 64GB RAM. Azure VMs with 64v CPUs, 128GB RAM, and 4TB SSD. Xeon W-2135.
Software Dependencies	No	The paper mentions software like Vizier and discusses other ANN algorithms but does not provide specific version numbers for any software dependencies used in their experiments.
Experiment Setup	Yes	The following benchmarks use a ﬁve-level quantization index (see Appendix A.6.1 for more details), resulting in a four-dimensional hyperparameter space. all three datasets used the same VQ and PQ settings the datasets are ﬁrst quantized to 4 × 10^6 centroids, and then again to 4 × 10^4 centroids. The PQ was performed with 16 centers (4 bits) per subspace, 4 dimensions per subspace at the X1 level, and 3 dimensions per subspace (rounding up) at the X3 level. Only the Yandex Text-to-Image dataset differs at the X5 level with its PQ setting (using 2 dimensions per subspace instead of 1). The search cost targets for the low, medium, and high settings were J(t) = 2 × 10^−6, 5 × 10^−6, and 1.2 × 10^−5, respectively.