Parallel and Efficient Hierarchical k-Median Clustering

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice. Our Results. In this work we design the first distributed algorithm for the hierarchical Euclidean k-median problem with provable guarantees. Furthermore, we complement our theoretical analysis with an experimental study in distributed setting where we show that our algorithm is significantly faster than parallel hierarchical [31] and flat solutions [3, 10, 18] for the k-median problem. In this section we empirically evaluate our algorithm with both hierarchical and non-hierarchical algorithms. We report the expected value and the variance over 5 runs for all the randomized algorithms. We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup... Figures and Tables are presented (Fig. 1, Table 1, Fig. 2, Fig. 3).
Researcher Affiliation Collaboration Vincent Cohen-Addad Google Research cohenaddad@google.com Silvio Lattanzi Google Research silviol@google.com Ashkan Norouzi-Fard Google Research ashkannorouzi@google.com Christian Sohler University of Cologne csohler@uni-koeln.de Ola Svensson EPFL ola.svensson@epfl.ch
Pseudocode Yes Algorithm 1 GREEDY alg. for hier. k-median on 2-RHST; Algorithm 2 Quasi-linear alg. for hier. k-median on RHST
Open Source Code No The paper states that the algorithm is implementable in standard frameworks like MapReduce, Hadoop, and Spark, but it does not provide a specific link or statement about releasing the source code for their proposed method.
Open Datasets Yes We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup (a.k.a Intrusion, n = 31, 029; d = 35),Year Prediction MSD (a.k.a SONG, n = 515, 345; d = 90),US Census Data (1990) (n = 2, 458, 285; d = 29),and HIGGS (n = 11, 000, 000; d = 28). [17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
Dataset Splits No The paper reports results for 5 runs (average value and variance) for randomized algorithms and mentions evaluating on datasets, but it does not provide specific details on training/validation/test splits, percentages, or counts used for experiments.
Hardware Specification No The paper mentions running experiments on 'machines with memory s' and discusses the MPC model, but it does not provide specific hardware details such as CPU/GPU models, memory sizes, or cloud instance types used for the experiments.
Software Dependencies No The paper mentions standard parallel frameworks like Map Reduce, Hadoop, Spark, and Dryad as where the algorithm can be implemented, but it does not specify exact version numbers for these or any other software dependencies required to reproduce the experiments.
Experiment Setup No The paper discusses the distributed setting and comparisons with other algorithms, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or training configurations.