Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LeanVec: Searching vectors faster by making them fit
Authors: Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Theodore L. Willke
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All in all, our extensive and varied experimental results show that Lean Vec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art. ... We present in Section 3 extensive experimental results comparing Lean Vec to its alternatives and showing its superiority across all relevant metrics. Diverse ablation studies show the impact of the different hyperparameters such as, for example, the target dimensionality d and the quantization level. |
| Researcher Affiliation | Industry | Mariano Tepperú EMAIL Ishwar Singh Bhatiú EMAIL Cecilia Aguerrebere EMAIL Mark Hildebrand EMAIL Ted Willke EMAIL Intel Labs |
| Pseudocode | Yes | Algorithm 1: Frank-Wolfe BCD optimization for Problem (9) with factor œ (0, 1). ... Algorithm 2: Eigenvector search optimization for Problem (15). |
| Open Source Code | No | For reproducibility, we will contribute the Lean Vec implementation to Scalable Vector Search, an open source library for high-performance similarity search.1 ... We also introduce and will open-source two new datasets with different types of OOD characteristics.2 |
| Open Datasets | Yes | We also introduce and will open-source two new datasets with different types of OOD characteristics.2 ... For ID and OOD evaluations, we use standard and recently introduced datasets (Zhang et al., 2022; Babenko and Lempitsky, 2021; Schuhmann et al., 2021; Aguerrebere et al., 2024). ... On t2i-200-10M, the benchmark dataset for the Neur IPS 23 Big-ANN competition (Simhadri et al., 2024), we consider the track winner Roar ANN (Chen et al., 2024). |
| Dataset Splits | Yes | We use separate learning and test query sets, each with 10K entries. ... To prevent overfitting, we use two separate query sets (see Appendix E): one to learn the Lean Vec-OOD projection matrices and to calibrate the runtime search parameters in SVS, and one to generate our results. ... For all datasets, we use an in-distribution query set for training/calibration and a separate out-of-distribution query set for testing. Both sets have 10K queries each. |
| Hardware Specification | No | At 72 threads (our system has 36 physical cores and 72 threads), Lean Vec provides a 8.5x performance gain over FP16 while consuming much less memory bandwidth (95 vs. 149GB/s). The paper does not specify a CPU model or other detailed hardware specifications for reproducibility. |
| Software Dependencies | No | We integrated the proposed Lean Vec into the state-of-the-art Scalable Vector Search (SVS) library (Aguerrebere et al., 2023) ... Lean Vec-OOD learning (Section 2.2) is implemented in Python... The paper mentions software names but does not provide specific version numbers for any libraries or programming languages used in the current implementation. |
| Experiment Setup | Yes | Throughout the experiments, Lean Vec uses LVQ8 for the primary vectors and FP16 for the secondary vectors. For each dataset, we use the dimensionality d that yields the highest search performance at 90% accuracy (see Table 1). For Lean Vec-OOD, we present the results using Algorithm 1... In practice, we use early termination in Algorithm 1, i.e., we stop the iterations whenever ---f A(t+1), B(t+1)2 A(t), B(t)2--/f A(t), B(t)2 Æ 10 3. |