Near Neighbor: Who is the Fairest of Them All?

Authors: Sariel Har-Peled, Sepideh Mahabadi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we run experiments to show performance of our approach on real data.
Researcher Affiliation Academia Sariel Har-Peled University of Illinois at Urbana-Champaign Champaign, IL 61801 sariel@illinois.edu Sepideh Mahabadi Toyota Technological Institute at Chicago Chicago, IL 60637 mahabadi@ttic.edu
Pseudocode No The paper describes algorithms in text and refers to appendices for proofs, but does not include explicit pseudocode blocks or algorithm listings in the main body.
Open Source Code No The paper does not provide any specific links or explicit statements about releasing source code for the described methodology.
Open Datasets Yes We run our experiments on three datasets that are standard benchmarks in the context of Nearest Neighbor algorithms (see [ABF17]) (I) Our first data set contains a random subset of 10K points in the MNIST training data set [LBBH98]3. (III) Finally, we take a random subset of 10K words from the Glo Ve data set [PSM14] and a random subset of 100 words as our query.
Dataset Splits No The paper mentions using a "MNIST training data set" and a "MNIST test data set" but does not specify a separate validation split or explicit percentages/counts for how the data is partitioned for training, validation, and testing.
Hardware Specification No The paper discusses experiment setup and parameters but does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using "locality sensitive hashing data structure for the L2 Euclidean distance [AI08]" and the "E2LSH library [And05]", but it does not specify version numbers for these software components.
Experiment Setup Yes For MNIST, the average distance of a query to its nearest neighbor in the our data set is around 4.5. Thus we choose the near neighbor radius r 5... we tune it between 3 and 5 and set its value to w 3.1. We tune k and L so that the false negative rate (the near points that are not retrieved by LSH) is less than 10%, and moreover the cost of hashing (proportional to L) balances out the cost of scanning. We thus get k 15 and L 100.