Graph-based Nearest Neighbor Search in Hyperbolic Spaces

Authors: Liudmila Prokhorenkova, Dmitry Baranchuk, Nikolay Bogachev, Yury Demidovich, Alexander Kolpakov

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This agrees with our experiments: we consider datasets embedded in hyperbolic and Euclidean spaces and show that graph-based NNS can be more efficient in the hyperbolic space. We also demonstrate that graph-based methods outperform other existing baselines for hyperbolic NNS.
Researcher Affiliation Collaboration Liudmila Prokhorenkova Yandex Research Moscow, Russia ostroumova-la@yandex.ru Dmitry Baranchuk Yandex Research Moscow, Russia dbaranchuk@yandex-team.ru Nikolay Bogachev IITP RAS; MIPT Moscow, Russia nvbogach@mail.ru Yury Demidovich MIPT Moscow, Russia demidovich.yua@phystech.edu Alexander Kolpakov Universit e de Neuchˆatel Neuchˆatel, Switzerland kolpakov.alexander@gmail.com
Pseudocode No The paper describes algorithms in paragraph text but does not include any formal pseudocode or algorithm blocks labeled as such.
Open Source Code Yes Additional experiments can be found in Appendix; the code supplements the submission.
Open Datasets Yes For this, we collect the largest publicly available Poincar e Glo Ve vocabulary of 189, 533 unique tokens. Poincar e Glo Ve was shown to outperform the Euclidean representation for tasks of similarity, analogy, and hypernymy detection (Tifrea et al., 2019). ... We also compare the efficiency of graph-based methods for Euclidean (Pennington et al., 2014) and Poincar e (Tifrea et al., 2019) Glo Ve embeddings of the same tokens.
Dataset Splits Yes The set of tokens is randomly split into the base set of 180K elements and the query set of the remaining elements for evaluation.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions software like HNSW, NSG, and LSH but does not provide specific version numbers for these or other software components/libraries used in the experiments.
Experiment Setup Yes For this, we follow the setup of Prokhorenkova & Shekhovtsov (2020) and perform a synthetic experiment. We generate datasets uniformly distributed in different spaces: 1) on a sphere (Prokhorenkova & Shekhovtsov, 2020), 2) within a Euclidean ball, 3) within a ball of radius R in the hyperbolic space of curvature 1. For all spaces, we fix n = 10^6 and vary the dimension d {2, 4, 6}. ... For both Glo Ve and Poincar e Glo Ve databases, we get M=10 and ef Construction=300. For other datasets, we get M=8. ... As in the original paper, we split the elements into 25 bands such that wi 1 1 1 x 2 wi. To achieve this, we take w = 1.05. For each band, we construct LSH with 50 tables. Each table has [log(m)] 1 bins, where m is the number of elements in the band. The number of probes is tuned for each band to reach 0.95 recall (this value has been chosen to achieve better results). When searching for the nearest neighbor for a query q, we compute the nearest neighbors within l bands closest to q and take the best one according to the hyperbolic distance. The number of bands l is varied from 1 to 10, thus giving the curve on Figure 2.