Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
C-MinHash: Improving Minwise Hashing with Circulant Permutation
Authors: Xiaoyun Li, Ping Li
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted to show the effectiveness of the proposed method. |
| Researcher Affiliation | Industry | Xiaoyun Li, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Minwise-hashing (Min Hash) Input: Binary data vector v {0, 1}D; K independent permutations π1, ..., πK: [D] [D] Output: K hash values h1(v), ..., h K(v) For k = 1 to K hk(v) mini:vi =0 πk(i) |
| Open Source Code | No | No explicit statement or link for open-source code release was found. |
| Open Datasets | Yes | We test C-Min Hash on four public datasets, including two text datasets: the NIPS full paper dataset from UCI repository (Dua and Graff, 2017), the BBC News dataset (Greene and Cunningham, 2006), and two popular image datasets: the MNIST dataset (Le Cun et al., 1998) with hand-written digits, and the CIFAR dataset (Krizhevsky, 2009) containing natural images. |
| Dataset Splits | No | The paper uses public datasets but does not explicitly describe train/validation/test splits, percentages, or the methodology for splitting. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | All the datasets are processed to be binary. For image data, we first transform the images to gray-scale, then binarize the samples by thresholding at 0.5. For each dataset with n data vectors, there are in total n(n 1)/2 data vector pairs. We estimate the Jaccard similarities for all the pairs and report the mean absolute errors (MAE). All the results are averaged over 10 independent repetitions. |