reproducibilityindex.ai

Practical Near Neighbor Search via Group Testing

Authors: Joshua Engels, Benjamin Coleman, Anshumali Shrivastava

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we ﬁnd that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.
Researcher Affiliation	Collaboration	Joshua Engels Department of Computer Science Rice University Houston, Texas, USA jae4@rice.edu Benjamin Coleman* Electrical and Computer Engineering Rice University Houston, Texas, USA ben.coleman@rice.edu Anshumali Shrivastava Department of Computer Science Rice University & Third AI Corp. Houston, Texas, USA anshumali@rice.edu
Pseudocode	Yes	Algorithm 1 Index Construction, Algorithm 2 Index Query, Algorithm 3 Threshold Relaxation Algorithm
Open Source Code	Yes	4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include code, which is not yet released under any license.
Open Datasets	Yes	Datasets: We tested FLINNG on high-dimensional genomics, web-scale data mining, and embedding search datasets. We list the datasets in Table 2 and brieﬂy describe them here. Ref Seq G and Ref Seq P are sets of reference genome and proteome sequences for approximately 88k species [17]. Prometh ION is a stream of raw metagenomic sequence reads from the latest sequencing machine by Oxford Nanopore [16]. The YFCC100M dataset consists of embeddings derived from neural network activations for 100M videos and images [21].
Dataset Splits	Yes	3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials.
Hardware Specification	Yes	3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We used two machines: one with 88 Intel Xeon E5-2699A v4 processors and one with 96 Intel Xeon Gold 5220R processors. There is more information in the supplementary materials.
Software Dependencies	No	We implemented FLINNG in C++, compiled with the highest level of optimization with GCC, and used Open MP to parallelize index construction. We use the hnswlib library and extended it to work on genomic datasets. FAISS is a highly optimized quantization-based library...
Experiment Setup	Yes	3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary materials. Due to space constraints, we provide information about hyperparameters, experiment setup, and computing hardware in the supplementary materials.