Simple and Efficient Weighted Minwise Hashing

Authors: Anshumali Shrivastava

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe s method without losing any accuracy.
Researcher Affiliation Academia Anshumali Shrivastava Department of Computer Science Rice University Houston, TX, 77005 anshumali@rice.edu
Pseudocode Yes Algorithm 1 Ioffe s CWS
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available.
Open Datasets Yes We chose two popular publicly available vision dataset Caltech101 [9] and Oxford [19].
Dataset Splits No The paper discusses evaluating estimation accuracy by varying the number of hashes (k) and averaging errors over repetitions, but does not specify a distinct 'validation' dataset split (e.g., in percentages or counts) or a k-fold cross-validation setup for model selection or hyperparameter tuning.
Hardware Specification Yes Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM.
Software Dependencies No Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. No specific version numbers for C# or other software dependencies are provided.
Experiment Setup Yes For this task we take 9 pairs of vectors from our datasets with varying level of similarities. For each of the pair (x, y), we generate k weighted minwise hashes hi(x) and hi(y) for i {1, 2, .., k}, using the three competing schemes. We then compute the estimate of the Jaccard similarity J(x, y) using the formula 1 k Pk i=1 1{hi(x) = hi(y)} (See Equation 7). We compute the errors in the estimate as a function of k. To minimize the effect of randomization, we average the errors from 200 random repetitions with different seeds. We plot this average error with k = {1, 2, ..., 50} in Figure 2 for different similarity levels.