Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Simple and Efficient Weighted Minwise Hashing
Authors: Anshumali Shrivastava
NeurIPS 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe s method without losing any accuracy. |
| Researcher Affiliation | Academia | Anshumali Shrivastava Department of Computer Science Rice University Houston, TX, 77005 EMAIL |
| Pseudocode | Yes | Algorithm 1 Ioffe s CWS |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | We chose two popular publicly available vision dataset Caltech101 [9] and Oxford [19]. |
| Dataset Splits | No | The paper discusses evaluating estimation accuracy by varying the number of hashes (k) and averaging errors over repetitions, but does not specify a distinct 'validation' dataset split (e.g., in percentages or counts) or a k-fold cross-validation setup for model selection or hyperparameter tuning. |
| Hardware Specification | Yes | Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. |
| Software Dependencies | No | Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. No specific version numbers for C# or other software dependencies are provided. |
| Experiment Setup | Yes | For this task we take 9 pairs of vectors from our datasets with varying level of similarities. For each of the pair (x, y), we generate k weighted minwise hashes hi(x) and hi(y) for i {1, 2, .., k}, using the three competing schemes. We then compute the estimate of the Jaccard similarity J(x, y) using the formula 1 k Pk i=1 1{hi(x) = hi(y)} (See Equation 7). We compute the errors in the estimate as a function of k. To minimize the effect of randomization, we average the errors from 200 random repetitions with different seeds. We plot this average error with k = {1, 2, ..., 50} in Figure 2 for different similarity levels. |