C-MinHash: Improving Minwise Hashing with Circulant Permutation

Authors: Xiaoyun Li, Ping Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted to show the effectiveness of the proposed method.
Researcher Affiliation Industry Xiaoyun Li, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {lixiaoyun996, pingli98}@gmail.com
Pseudocode Yes Algorithm 1 Minwise-hashing (Min Hash) Input: Binary data vector v {0, 1}D; K independent permutations π1, ..., πK: [D] [D] Output: K hash values h1(v), ..., h K(v) For k = 1 to K hk(v) mini:vi =0 πk(i)
Open Source Code No No explicit statement or link for open-source code release was found.
Open Datasets Yes We test C-Min Hash on four public datasets, including two text datasets: the NIPS full paper dataset from UCI repository (Dua and Graff, 2017), the BBC News dataset (Greene and Cunningham, 2006), and two popular image datasets: the MNIST dataset (Le Cun et al., 1998) with hand-written digits, and the CIFAR dataset (Krizhevsky, 2009) containing natural images.
Dataset Splits No The paper uses public datasets but does not explicitly describe train/validation/test splits, percentages, or the methodology for splitting.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for the experiments are mentioned in the paper.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes All the datasets are processed to be binary. For image data, we first transform the images to gray-scale, then binarize the samples by thresholding at 0.5. For each dataset with n data vectors, there are in total n(n 1)/2 data vector pairs. We estimate the Jaccard similarities for all the pairs and report the mean absolute errors (MAE). All the results are averaged over 10 independent repetitions.