Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Practical Near Neighbor Search via Group Testing

Authors: Joshua Engels, Benjamin Coleman, Anshumali Shrivastava

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we ๏ฌnd that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.
Researcher Affiliation Collaboration Joshua Engels Department of Computer Science Rice University Houston, Texas, USA EMAIL Benjamin Coleman* Electrical and Computer Engineering Rice University Houston, Texas, USA EMAIL Anshumali Shrivastava Department of Computer Science Rice University & Third AI Corp. Houston, Texas, USA EMAIL
Pseudocode Yes Algorithm 1 Index Construction, Algorithm 2 Index Query, Algorithm 3 Threshold Relaxation Algorithm
Open Source Code Yes 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include code, which is not yet released under any license.
Open Datasets Yes Datasets: We tested FLINNG on high-dimensional genomics, web-scale data mining, and embedding search datasets. We list the datasets in Table 2 and brie๏ฌ‚y describe them here. Ref Seq G and Ref Seq P are sets of reference genome and proteome sequences for approximately 88k species [17]. Prometh ION is a stream of raw metagenomic sequence reads from the latest sequencing machine by Oxford Nanopore [16]. The YFCC100M dataset consists of embeddings derived from neural network activations for 100M videos and images [21].
Dataset Splits Yes 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials.
Hardware Specification Yes 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We used two machines: one with 88 Intel Xeon E5-2699A v4 processors and one with 96 Intel Xeon Gold 5220R processors. There is more information in the supplementary materials.
Software Dependencies No We implemented FLINNG in C++, compiled with the highest level of optimization with GCC, and used Open MP to parallelize index construction. We use the hnswlib library and extended it to work on genomic datasets. FAISS is a highly optimized quantization-based library...
Experiment Setup Yes 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials. Due to space constraints, we provide information about hyperparameters, experiment setup, and computing hardware in the supplementary materials.