Practical and Optimal LSH for Angular Distance
Authors: Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also introduce a multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets. We complement these contributions with an experimental evaluation on both real and synthetic data (SIFT vectors, tf-idf data, and a random point set). Our results show that for data sets with around 10^5 to 10^8 points, our multiprobe variant of the cross-polytope LSH is up to 10 faster than an efficient implementation of the hyperplane LSH, and up to 700 faster than a linear scan. |
| Researcher Affiliation | Academia | Alexandr Andoni Columbia University Piotr Indyk MIT Thijs Laarhoven TU Eindhoven Ilya Razenshteyn MIT Ludwig Schmidt MIT |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the described methodology. It discusses open-source aspects of LSH in general but does not link its own implementation. |
| Open Datasets | Yes | We evaluate the two hashing schemes on three types of data sets. We use a synthetic data set of randomly generated points because this allows us to vary a single problem parameter while keeping the remaining parameters constant. We also investigate the performance of our algorithm on real data: two tf-idf data sets [25] and a set of SIFT feature vectors [7]. [25] refers to UCI Machine Learning Repository, which is a public source. |
| Dataset Splits | No | The paper mentions evaluating performance on data sets and setting parameters, but it does not specify explicit train/validation/test splits, percentages, or methodology for reproduction of data partitioning beyond stating that it uses a 'random data set' and 'real data'. |
| Hardware Specification | No | The paper mentions 'CPU' in the context of experimental setup in Appendix D (which is not provided in the excerpt), but it does not specify any exact GPU/CPU models, processor types, or detailed computer specifications for the hardware used in experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names like Python 3.8, PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | In all experiments, we set the algorithm parameters so that the empirical probability of successfully finding the exact nearest neighbor is at least 0.9. Moreover, we set the number of LSH tables L so that the amount of additional memory occupied by the LSH data structure is comparable to the amount of memory necessary for storing the data set. For 2^20 points, the cross-polytope LSH is already 3.5 faster than the hyperplane LSH, and for n = 2^28 the speedup is 10.3 (see Table 3 in Appendix D). The number of hash tables L is set to 10. For the cross-polytope hash, we consider partial cross-polytopes in the last of the k hash functions in order to get a smooth trade-off between the various parameters (see Section 3.1). Table 1 includes specific 'k' values used, such as '2 (64)' for NYT and '6 (1)' for SIFT, indicating specific parameter choices for different datasets. |