Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LeanVec: Searching vectors faster by making them fit

Authors: Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Theodore L. Willke

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All in all, our extensive and varied experimental results show that Lean Vec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art. ... We present in Section 3 extensive experimental results comparing Lean Vec to its alternatives and showing its superiority across all relevant metrics. Diverse ablation studies show the impact of the different hyperparameters such as, for example, the target dimensionality d and the quantization level.
Researcher Affiliation Industry Mariano Tepperú EMAIL Ishwar Singh Bhatiú EMAIL Cecilia Aguerrebere EMAIL Mark Hildebrand EMAIL Ted Willke EMAIL Intel Labs
Pseudocode Yes Algorithm 1: Frank-Wolfe BCD optimization for Problem (9) with factor œ (0, 1). ... Algorithm 2: Eigenvector search optimization for Problem (15).
Open Source Code No For reproducibility, we will contribute the Lean Vec implementation to Scalable Vector Search, an open source library for high-performance similarity search.1 ... We also introduce and will open-source two new datasets with different types of OOD characteristics.2
Open Datasets Yes We also introduce and will open-source two new datasets with different types of OOD characteristics.2 ... For ID and OOD evaluations, we use standard and recently introduced datasets (Zhang et al., 2022; Babenko and Lempitsky, 2021; Schuhmann et al., 2021; Aguerrebere et al., 2024). ... On t2i-200-10M, the benchmark dataset for the Neur IPS 23 Big-ANN competition (Simhadri et al., 2024), we consider the track winner Roar ANN (Chen et al., 2024).
Dataset Splits Yes We use separate learning and test query sets, each with 10K entries. ... To prevent overfitting, we use two separate query sets (see Appendix E): one to learn the Lean Vec-OOD projection matrices and to calibrate the runtime search parameters in SVS, and one to generate our results. ... For all datasets, we use an in-distribution query set for training/calibration and a separate out-of-distribution query set for testing. Both sets have 10K queries each.
Hardware Specification No At 72 threads (our system has 36 physical cores and 72 threads), Lean Vec provides a 8.5x performance gain over FP16 while consuming much less memory bandwidth (95 vs. 149GB/s). The paper does not specify a CPU model or other detailed hardware specifications for reproducibility.
Software Dependencies No We integrated the proposed Lean Vec into the state-of-the-art Scalable Vector Search (SVS) library (Aguerrebere et al., 2023) ... Lean Vec-OOD learning (Section 2.2) is implemented in Python... The paper mentions software names but does not provide specific version numbers for any libraries or programming languages used in the current implementation.
Experiment Setup Yes Throughout the experiments, Lean Vec uses LVQ8 for the primary vectors and FP16 for the secondary vectors. For each dataset, we use the dimensionality d that yields the highest search performance at 90% accuracy (see Table 1). For Lean Vec-OOD, we present the results using Algorithm 1... In practice, we use early termination in Algorithm 1, i.e., we stop the iterations whenever ---f A(t+1), B(t+1)2 A(t), B(t)2--/f A(t), B(t)2 Æ 10 3.