Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributed Clustering via LSH Based Data Partitioning

Authors: Aditya Bhaskara, Maheshakya Wijewardena

ICML 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6. Empirical Study We evaluate our algorithmic ideas on two synthetic and two real datasets, of varying sizes. In the former case, we know the ground truth clustering, the right k , and the optimum objective value. We use it to demonstrate how the quality of clustering depends on the parameter ℓ the number of independent hashes we concatenate. In all the datasets, we compare the objective value obtained by our algorithm with the one obtained by k-means++ (part of scikit-learn (Pedregosa et al., 2011)). This will only be possible for small enough datasets, as k-means++ is a single machine algorithm that uses Ω(nk) memory.
Researcher Affiliation	Academia	1School of Computing, University of Utah. Correspondence to: Aditya Bhaskara <EMAIL>, Maheshakya Wijewardena <EMAIL>.
Pseudocode	Yes	Algorithm 1 Find-Cover Input: set of points U, rough estimate of optimum D. Output: a subset of points S. for i = 1 . . . κ do Hash every point in U to a bin (range [Lk]) using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2. Let Uj be the points hashed to bin j. Let Gj be the group of machines assigned for bin j. For each j, assign points in Uj uniformly at random to a machine in Gj. For each j, select a uniformly random subset of Uj of size O(1) from Gj and add them to S. (If the number of points in the group is O(1), add all of them to S.) end for
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository.
Open Datasets	Yes	The ﬁrst dataset is SUSY (see (P. Baldi, 2014)), which contains 5M points with 18 features." and "FMA: A Dataset For Music Analysis This dataset (see (Defferrard et al., 2017)) contains 518 features extracted from audio ﬁles available in the free music archive (FMA). It has 106574 points.
Dataset Splits	No	The paper mentions using a 'sub-sample of 100000 points' for the SUSY dataset and then generating 'coresets' by sampling points from buckets. However, it does not specify explicit train/validation/test splits with percentages, counts, or references to predefined splits for model evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions 'k-means++ (part of scikit-learn (Pedregosa et al., 2011))' as a comparison baseline, but does not specify the version of scikit-learn or any other software dependencies with version numbers for their own implementation or experimental setup.
Experiment Setup	Yes	The algorithm can now be described (see Algorithm 4.1). ... using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2." and "For the ﬁrst dataset, we produce PLSH tuples using w = 15, t = 2, and vary ℓ.