Distributed Clustering via LSH Based Data Partitioning

Authors: Aditya Bhaskara, Maheshakya Wijewardena

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Empirical Study We evaluate our algorithmic ideas on two synthetic and two real datasets, of varying sizes. In the former case, we know the ground truth clustering, the right k , and the optimum objective value. We use it to demonstrate how the quality of clustering depends on the parameter ℓ the number of independent hashes we concatenate. In all the datasets, we compare the objective value obtained by our algorithm with the one obtained by k-means++ (part of scikit-learn (Pedregosa et al., 2011)). This will only be possible for small enough datasets, as k-means++ is a single machine algorithm that uses Ω(nk) memory.
Researcher Affiliation Academia 1School of Computing, University of Utah. Correspondence to: Aditya Bhaskara <bhaskaraaditya@gmail.com>, Maheshakya Wijewardena <pmaheshakya4@gmail.com>.
Pseudocode Yes Algorithm 1 Find-Cover Input: set of points U, rough estimate of optimum D. Output: a subset of points S. for i = 1 . . . κ do Hash every point in U to a bin (range [Lk]) using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2. Let Uj be the points hashed to bin j. Let Gj be the group of machines assigned for bin j. For each j, assign points in Uj uniformly at random to a machine in Gj. For each j, select a uniformly random subset of Uj of size O(1) from Gj and add them to S. (If the number of points in the group is O(1), add all of them to S.) end for
Open Source Code No The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes The first dataset is SUSY (see (P. Baldi, 2014)), which contains 5M points with 18 features." and "FMA: A Dataset For Music Analysis This dataset (see (Defferrard et al., 2017)) contains 518 features extracted from audio files available in the free music archive (FMA). It has 106574 points.
Dataset Splits No The paper mentions using a 'sub-sample of 100000 points' for the SUSY dataset and then generating 'coresets' by sampling points from buckets. However, it does not specify explicit train/validation/test splits with percentages, counts, or references to predefined splits for model evaluation.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions 'k-means++ (part of scikit-learn (Pedregosa et al., 2011))' as a comparison baseline, but does not specify the version of scikit-learn or any other software dependencies with version numbers for their own implementation or experimental setup.
Experiment Setup Yes The algorithm can now be described (see Algorithm 4.1). ... using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2." and "For the first dataset, we produce PLSH tuples using w = 15, t = 2, and vary ℓ.