Distributed Clustering via LSH Based Data Partitioning
Authors: Aditya Bhaskara, Maheshakya Wijewardena
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Empirical Study We evaluate our algorithmic ideas on two synthetic and two real datasets, of varying sizes. In the former case, we know the ground truth clustering, the right k , and the optimum objective value. We use it to demonstrate how the quality of clustering depends on the parameter ℓ the number of independent hashes we concatenate. In all the datasets, we compare the objective value obtained by our algorithm with the one obtained by k-means++ (part of scikit-learn (Pedregosa et al., 2011)). This will only be possible for small enough datasets, as k-means++ is a single machine algorithm that uses Ω(nk) memory. |
| Researcher Affiliation | Academia | 1School of Computing, University of Utah. Correspondence to: Aditya Bhaskara <bhaskaraaditya@gmail.com>, Maheshakya Wijewardena <pmaheshakya4@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Find-Cover Input: set of points U, rough estimate of optimum D. Output: a subset of points S. for i = 1 . . . κ do Hash every point in U to a bin (range [Lk]) using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2. Let Uj be the points hashed to bin j. Let Gj be the group of machines assigned for bin j. For each j, assign points in Uj uniformly at random to a machine in Gj. For each j, select a uniformly random subset of Uj of size O(1) from Gj and add them to S. (If the number of points in the group is O(1), add all of them to S.) end for |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository. |
| Open Datasets | Yes | The first dataset is SUSY (see (P. Baldi, 2014)), which contains 5M points with 18 features." and "FMA: A Dataset For Music Analysis This dataset (see (Defferrard et al., 2017)) contains 518 features extracted from audio files available in the free music archive (FMA). It has 106574 points. |
| Dataset Splits | No | The paper mentions using a 'sub-sample of 100000 points' for the SUSY dataset and then generating 'coresets' by sampling points from buckets. However, it does not specify explicit train/validation/test splits with percentages, counts, or references to predefined splits for model evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'k-means++ (part of scikit-learn (Pedregosa et al., 2011))' as a comparison baseline, but does not specify the version of scikit-learn or any other software dependencies with version numbers for their own implementation or experimental setup. |
| Experiment Setup | Yes | The algorithm can now be described (see Algorithm 4.1). ... using the two layer hash with params t, wi, ℓ, Lk, where wi := 8ri(log n)3/2." and "For the first dataset, we produce PLSH tuples using w = 15, t = 2, and vary ℓ. |