Affinity Clustering: Hierarchical Clustering at Scale
Authors: Mohammadhossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experimentally that our algorithms are scalable for huge data sets, e.g., for graphs with trillions of edges. ... Last but not least, we present an experimental study where we analyze the scalability and effectiveness of our newly introduced algorithms and we observe that, in most cases, affinity clustering outperforms all state-of-the-art algorithms from both quality and scalability standpoints. |
| Researcher Affiliation | Collaboration | Mohammad Hossein Bateni Google Research bateni@google.com Soheil Behnezhad University of Maryland soheil@cs.umd.edu Mahsa Derakhshan University of Maryland mahsaa@cs.umd.edu Mohammad Taghi Hajiaghayi University of Maryland hajiagha@cs.umd.edu Raimondas Kiveris Google Research rkiveris@google.com Silvio Lattanzi Google Research silviol@google.com Vahab Mirrokni Google Research mirrokni@google.com |
| Pseudocode | Yes | Algorithm 1 MST of Dense Graphs ... (See Algorithm 2 in the appendix.) |
| Open Source Code | Yes | Implementations are available at https://github.com/Mahsa Derakhshan/Affinity Clustering. |
| Open Datasets | Yes | We run our experiments on several data sets from the UCI database [37] and use Euclidean distance6. ... 6We consider Iris, Wine, Soybean, Digits and Glass data sets. ... [37] Moshe Lichman. UCI machine learning repository, 2013. |
| Dataset Splits | No | The paper mentions using datasets from the UCI database and evaluating performance using the Rand index with a 'ground truth clustering T'. However, it does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages, sample counts, or specific predefined split citations). |
| Hardware Specification | No | The paper states 'While we cannot reveal the exact running times and number of machines used in the experiments, we report these quantities in normalized form.' It mentions 'Map Reduce workers' and 'machines for the DHT' but provides no specific hardware details such as GPU or CPU models, memory, or specific cloud instances used for the experiments. |
| Software Dependencies | No | The paper mentions using distributed computing platforms like 'Spark [45] and Hadoop [43] as well as Map Reduce and its extension Flume [17]' and 'Distributed Hash Tables (DHTs) [12, 31]', but it does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | No | The paper discusses the algorithms and their evaluation but does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or specific training configurations for the algorithms tested. |