Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond

Authors: Lin Chen, Hossein Esfandiari, Gang Fu, Vahab Mirrokni

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the first part, we verify the theoretical bounds derived in Theorem 1 on real data. We used the latent Dirichlet allocation to extract the topic distributions of Reuters-21578, Distribution 1.0. The number of topics is set to 10. We sampled 100 documents uniformly at random and computed the GJS divergence and Hellinger distance between each pair of topic distributions. Each dot in Fig. 1 represents the topic distribution of a document. The horizontal axis denotes the Hellinger distance while the vertical axis denotes the GJS divergence. We chose different parameter values (λ = 1/2, 1/3, 1/10) for the GJS divergence. From the three subfigures, we observe that both the upper and lower bounds are tight for the data. In the second part, we apply the proposed LSH scheme for the GJS divergence to the nearest neighbor search problem in Fashion MNIST [39], MNIST [23], and CIFAR-10 [21]. Each image in the datasets is flattened into a vector and L1-normalized, thereby summing to 1. As described in Section 2.1, a concatenation of hash functions is used. We denote the number of concatenated hash functions by K and the number of compound hash functions by L. In the first set of experiments, we set K = 3 and vary L from 20 to 40. We measure the execution time of LSH-based k-nearest neighbor search and the exact (brute-force) algorithm, where k is set to 20.
Researcher Affiliation Collaboration Lin Chen1,2 Hossein Esfandiari2 Thomas Fu2 Vahab S. Mirrokni2 1Yale University 2Google Research lin.chen@yale.edu, {esfandiari,thomasfu,mirrokni}@google.com
Pseudocode Yes Algorithm 1 Kre ın-LSH Input: Discretization parameters J N and > 0. Output: The left and right Kre ın transform η1 and η2.
Open Source Code No The paper does not provide concrete access to source code for the methodology described, nor does it state that the code is open-source or provide a link.
Open Datasets Yes We used the latent Dirichlet allocation to extract the topic distributions of Reuters-21578, Distribution 1.0. ... we apply the proposed LSH scheme for the GJS divergence to the nearest neighbor search problem in Fashion MNIST [39], MNIST [23], and CIFAR-10 [21].
Dataset Splits No The paper mentions using datasets for experiments (Reuters-21578, Fashion MNIST, MNIST, CIFAR-10) but does not provide specific details on training, validation, or test data splits (e.g., percentages, sample counts, or explicit splitting methodologies).
Hardware Specification Yes Both algorithms were run on a 2.2 GHz Intel Core i7 processor.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes In the first set of experiments, we set K = 3 and vary L from 20 to 40. We measure the execution time of LSH-based k-nearest neighbor search and the exact (brute-force) algorithm, where k is set to 20. ... We also vary the parameter of the GJS divergence and choose λ from {1/2, 1/3, 1/10}. ... In the second set of experiments, we fix the parameter of the GJS divergence to 1/2; i.e., the JS divergence is used. The number of concatenated hash functions K ranges from 3 to 5 or 4 to 6.