reproducibilityindex.ai

Simple and Efficient Weighted Minwise Hashing

Authors: Anshumali Shrivastava

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe s method without losing any accuracy.
Researcher Affiliation	Academia	Anshumali Shrivastava Department of Computer Science Rice University Houston, TX, 77005 anshumali@rice.edu
Pseudocode	Yes	Algorithm 1 Ioffe s CWS
Open Source Code	No	The paper does not provide any explicit statement or link indicating that the source code for their methodology is publicly available.
Open Datasets	Yes	We chose two popular publicly available vision dataset Caltech101 [9] and Oxford [19].
Dataset Splits	No	The paper discusses evaluating estimation accuracy by varying the number of hashes (k) and averaging errors over repetitions, but does not specify a distinct 'validation' dataset split (e.g., in percentages or counts) or a k-fold cross-validation setup for model selection or hyperparameter tuning.
Hardware Specification	Yes	Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM.
Software Dependencies	No	Our experiments were coded in C# on Intel Xenon CPU with 256 GB RAM. No specific version numbers for C# or other software dependencies are provided.
Experiment Setup	Yes	For this task we take 9 pairs of vectors from our datasets with varying level of similarities. For each of the pair (x, y), we generate k weighted minwise hashes hi(x) and hi(y) for i {1, 2, .., k}, using the three competing schemes. We then compute the estimate of the Jaccard similarity J(x, y) using the formula 1 k Pk i=1 1{hi(x) = hi(y)} (See Equation 7). We compute the errors in the estimate as a function of k. To minimize the effect of randomization, we average the errors from 200 random repetitions with different seeds. We plot this average error with k = {1, 2, ..., 50} in Figure 2 for different similarity levels.