reproducibilityindex.ai

Parallel and Efficient Hierarchical k-Median Clustering

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice. Our Results. In this work we design the ﬁrst distributed algorithm for the hierarchical Euclidean k-median problem with provable guarantees. Furthermore, we complement our theoretical analysis with an experimental study in distributed setting where we show that our algorithm is signiﬁcantly faster than parallel hierarchical [31] and ﬂat solutions [3, 10, 18] for the k-median problem. In this section we empirically evaluate our algorithm with both hierarchical and non-hierarchical algorithms. We report the expected value and the variance over 5 runs for all the randomized algorithms. We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup... Figures and Tables are presented (Fig. 1, Table 1, Fig. 2, Fig. 3).
Researcher Affiliation	Collaboration	Vincent Cohen-Addad Google Research cohenaddad@google.com Silvio Lattanzi Google Research silviol@google.com Ashkan Norouzi-Fard Google Research ashkannorouzi@google.com Christian Sohler University of Cologne csohler@uni-koeln.de Ola Svensson EPFL ola.svensson@epfl.ch
Pseudocode	Yes	Algorithm 1 GREEDY alg. for hier. k-median on 2-RHST; Algorithm 2 Quasi-linear alg. for hier. k-median on RHST
Open Source Code	No	The paper states that the algorithm is implementable in standard frameworks like MapReduce, Hadoop, and Spark, but it does not provide a specific link or statement about releasing the source code for their proposed method.
Open Datasets	Yes	We evaluate the algorithms on the following data sets from UCI Machine Learning Repository [17]; KDD Cup (a.k.a Intrusion, n = 31, 029; d = 35),Year Prediction MSD (a.k.a SONG, n = 515, 345; d = 90),US Census Data (1990) (n = 2, 458, 285; d = 29),and HIGGS (n = 11, 000, 000; d = 28). [17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
Dataset Splits	No	The paper reports results for 5 runs (average value and variance) for randomized algorithms and mentions evaluating on datasets, but it does not provide specific details on training/validation/test splits, percentages, or counts used for experiments.
Hardware Specification	No	The paper mentions running experiments on 'machines with memory s' and discusses the MPC model, but it does not provide specific hardware details such as CPU/GPU models, memory sizes, or cloud instance types used for the experiments.
Software Dependencies	No	The paper mentions standard parallel frameworks like Map Reduce, Hadoop, Spark, and Dryad as where the algorithm can be implemented, but it does not specify exact version numbers for these or any other software dependencies required to reproduce the experiments.
Experiment Setup	No	The paper discusses the distributed setting and comparisons with other algorithms, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or training configurations.