Parallel and Efficient Hierarchical k-Median Clustering
Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice. Our Results. In this work we design the first distributed algorithm for the hierarchical Euclidean k-median problem with provable guarantees. Furthermore, we complement our theoretical analysis with an experimental study in distributed setting where we show that our algorithm is significantly faster than parallel hierarchical [31] and flat solutions [3, 10, 18] for the k-median problem. In this section we empirically evaluate our algorithm with both hierarchical and non-hierarchical algorithms. We report the expected value and the variance over 5 runs for all the randomized algorithms. We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup... Figures and Tables are presented (Fig. 1, Table 1, Fig. 2, Fig. 3). |
| Researcher Affiliation | Collaboration | Vincent Cohen-Addad Google Research cohenaddad@google.com Silvio Lattanzi Google Research silviol@google.com Ashkan Norouzi-Fard Google Research ashkannorouzi@google.com Christian Sohler University of Cologne csohler@uni-koeln.de Ola Svensson EPFL ola.svensson@epfl.ch |
| Pseudocode | Yes | Algorithm 1 GREEDY alg. for hier. k-median on 2-RHST; Algorithm 2 Quasi-linear alg. for hier. k-median on RHST |
| Open Source Code | No | The paper states that the algorithm is implementable in standard frameworks like MapReduce, Hadoop, and Spark, but it does not provide a specific link or statement about releasing the source code for their proposed method. |
| Open Datasets | Yes | We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup (a.k.a Intrusion, n = 31, 029; d = 35),Year Prediction MSD (a.k.a SONG, n = 515, 345; d = 90),US Census Data (1990) (n = 2, 458, 285; d = 29),and HIGGS (n = 11, 000, 000; d = 28). [17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. |
| Dataset Splits | No | The paper reports results for 5 runs (average value and variance) for randomized algorithms and mentions evaluating on datasets, but it does not provide specific details on training/validation/test splits, percentages, or counts used for experiments. |
| Hardware Specification | No | The paper mentions running experiments on 'machines with memory s' and discusses the MPC model, but it does not provide specific hardware details such as CPU/GPU models, memory sizes, or cloud instance types used for the experiments. |
| Software Dependencies | No | The paper mentions standard parallel frameworks like Map Reduce, Hadoop, Spark, and Dryad as where the algorithm can be implemented, but it does not specify exact version numbers for these or any other software dependencies required to reproduce the experiments. |
| Experiment Setup | No | The paper discusses the distributed setting and comparisons with other algorithms, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or training configurations. |