Simple and Efficient Weighted Minwise Hashing
Authors: Anshumali Shrivastava
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe s method without losing any accuracy. |
| Researcher Affiliation | Academia | Anshumali Shrivastava Department of Computer Science Rice University Houston, TX, 77005 anshumali@rice.edu |
| Pseudocode | Yes | Algorithm 1 Ioffe s CWS |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | We chose two popular publicly available vision dataset Caltech101 [9] and Oxford [19]. |
| Dataset Splits | No | The paper discusses evaluating estimation accuracy by varying the number of hashes (k) and averaging errors over repetitions, but does not specify a distinct 'validation' dataset split (e.g., in percentages or counts) or a k-fold cross-validation setup for model selection or hyperparameter tuning. |
| Hardware Specification | Yes | Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. |
| Software Dependencies | No | Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. No specific version numbers for C# or other software dependencies are provided. |
| Experiment Setup | Yes | For this task we take 9 pairs of vectors from our datasets with varying level of similarities. For each of the pair (x, y), we generate k weighted minwise hashes hi(x) and hi(y) for i {1, 2, .., k}, using the three competing schemes. We then compute the estimate of the Jaccard similarity J(x, y) using the formula 1 k Pk i=1 1{hi(x) = hi(y)} (See Equation 7). We compute the errors in the estimate as a function of k. To minimize the effect of randomization, we average the errors from 200 random repetitions with different seeds. We plot this average error with k = {1, 2, ..., 50} in Figure 2 for different similarity levels. |