Practical Near Neighbor Search via Group Testing
Authors: Joshua Engels, Benjamin Coleman, Anshumali Shrivastava
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory. |
| Researcher Affiliation | Collaboration | Joshua Engels Department of Computer Science Rice University Houston, Texas, USA jae4@rice.edu Benjamin Coleman* Electrical and Computer Engineering Rice University Houston, Texas, USA ben.coleman@rice.edu Anshumali Shrivastava Department of Computer Science Rice University & Third AI Corp. Houston, Texas, USA anshumali@rice.edu |
| Pseudocode | Yes | Algorithm 1 Index Construction, Algorithm 2 Index Query, Algorithm 3 Threshold Relaxation Algorithm |
| Open Source Code | Yes | 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include code, which is not yet released under any license. |
| Open Datasets | Yes | Datasets: We tested FLINNG on high-dimensional genomics, web-scale data mining, and embedding search datasets. We list the datasets in Table 2 and briefly describe them here. Ref Seq G and Ref Seq P are sets of reference genome and proteome sequences for approximately 88k species [17]. Prometh ION is a stream of raw metagenomic sequence reads from the latest sequencing machine by Oxford Nanopore [16]. The YFCC100M dataset consists of embeddings derived from neural network activations for 100M videos and images [21]. |
| Dataset Splits | Yes | 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials. |
| Hardware Specification | Yes | 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We used two machines: one with 88 Intel Xeon E5-2699A v4 processors and one with 96 Intel Xeon Gold 5220R processors. There is more information in the supplementary materials. |
| Software Dependencies | No | We implemented FLINNG in C++, compiled with the highest level of optimization with GCC, and used Open MP to parallelize index construction. We use the hnswlib library and extended it to work on genomic datasets. FAISS is a highly optimized quantization-based library... |
| Experiment Setup | Yes | 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials. Due to space constraints, we provide information about hyperparameters, experiment setup, and computing hardware in the supplementary materials. |