C-MinHash: Improving Minwise Hashing with Circulant Permutation
Authors: Xiaoyun Li, Ping Li
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted to show the effectiveness of the proposed method. |
| Researcher Affiliation | Industry | Xiaoyun Li, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {lixiaoyun996, pingli98}@gmail.com |
| Pseudocode | Yes | Algorithm 1 Minwise-hashing (Min Hash) Input: Binary data vector v {0, 1}D; K independent permutations π1, ..., πK: [D] [D] Output: K hash values h1(v), ..., h K(v) For k = 1 to K hk(v) mini:vi =0 πk(i) |
| Open Source Code | No | No explicit statement or link for open-source code release was found. |
| Open Datasets | Yes | We test C-Min Hash on four public datasets, including two text datasets: the NIPS full paper dataset from UCI repository (Dua and Graff, 2017), the BBC News dataset (Greene and Cunningham, 2006), and two popular image datasets: the MNIST dataset (Le Cun et al., 1998) with hand-written digits, and the CIFAR dataset (Krizhevsky, 2009) containing natural images. |
| Dataset Splits | No | The paper uses public datasets but does not explicitly describe train/validation/test splits, percentages, or the methodology for splitting. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | All the datasets are processed to be binary. For image data, we first transform the images to gray-scale, then binarize the samples by thresholding at 0.5. For each dataset with n data vectors, there are in total n(n 1)/2 data vector pairs. We estimate the Jaccard similarities for all the pairs and report the mean absolute errors (MAE). All the results are averaged over 10 independent repetitions. |