reproducibilityindex.ai

C-MinHash: Improving Minwise Hashing with Circulant Permutation

Authors: Xiaoyun Li, Ping Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted to show the effectiveness of the proposed method.
Researcher Affiliation	Industry	Xiaoyun Li, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {lixiaoyun996, pingli98}@gmail.com
Pseudocode	Yes	Algorithm 1 Minwise-hashing (Min Hash) Input: Binary data vector v {0, 1}D; K independent permutations π1, ..., πK: [D] [D] Output: K hash values h1(v), ..., h K(v) For k = 1 to K hk(v) mini:vi =0 πk(i)
Open Source Code	No	No explicit statement or link for open-source code release was found.
Open Datasets	Yes	We test C-Min Hash on four public datasets, including two text datasets: the NIPS full paper dataset from UCI repository (Dua and Graff, 2017), the BBC News dataset (Greene and Cunningham, 2006), and two popular image datasets: the MNIST dataset (Le Cun et al., 1998) with hand-written digits, and the CIFAR dataset (Krizhevsky, 2009) containing natural images.
Dataset Splits	No	The paper uses public datasets but does not explicitly describe train/validation/test splits, percentages, or the methodology for splitting.
Hardware Specification	No	No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for the experiments are mentioned in the paper.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	All the datasets are processed to be binary. For image data, we first transform the images to gray-scale, then binarize the samples by thresholding at 0.5. For each dataset with n data vectors, there are in total n(n 1)/2 data vector pairs. We estimate the Jaccard similarities for all the pairs and report the mean absolute errors (MAE). All the results are averaged over 10 independent repetitions.