Hierarchical Clustering Beyond the Worst-Case

Authors: Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the effectiveness of LINKAGE++ on real-world and synthetic datasets. We compare our results to the classic agglomerative heuristics for hierarchical clustering both in terms of the cost function and the classification error. Our goal is answering the question: How good is LINKAGE++ compared to the classic agglomerative approaches on real-world and synthetic data that exhibit a ground-truth clustering?
Researcher Affiliation Academia Vincent Cohen-Addad University of Copenhagen vcohenad@gmail.com Varun Kanade University of Oxford Alan Turing Institute varunk@cs.ox.ac.uk Frederik Mallmann-Trenn MIT mallmann@mit.edu
Pseudocode Yes Algorithm 1 LINKAGE++
Open Source Code No No concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper was found.
Open Datasets Yes The datasets we use are part of the standard Scikit-learn library [4] (and most of them are available at the UCI machine learning repository [1]). Most of these datasets exhibit a flat clustering structure, with the exception of the newsgroup datasets which is truly hierarchical. The goal of the algorithm is to perform a clustering of the data by finding the underlying classes. The datasets are: iris, digits, newsgroup2, diabetes, cancer, boston.
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning was found. The paper mentions evaluating on datasets but does not describe training, validation, or test splits.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments were found.
Software Dependencies No The datasets we use are part of the standard Scikit-learn library [4]. No specific version numbers for software dependencies were provided.
Experiment Setup Yes We run LINKAGE++ with 9 different breakpoints at which we switch between phase 1 and phase 2 (which corresponds to guesses of k). We output the clustering with the smallest cost. To evaluate our algorithm, we compare its performances to classic agglomerative heuristics (for the similarity setting): single linkage, complete linkage, (see also [24, 13] for a complete description) and to the approach of performing only phase 1 of LINKAGE++ until only one cluster remains; we will denote the approach as PCA+. Additionally, we compare ourselves to applying only phase 2 of LINKAGE++, we call this approach density-based linkage. We generate random graphs of sizes n P t256,512,1024u according to the model described in Section 2.1. More precisely, we define a binary tree on ℓPt4,8u bottom clusters/leaves. Each leaf represents a class . We create n{ℓvertices for each class. The probability of having an edge between two vertices of class a and b is given by the probability induced by lowest common ancestor between the leaves corresponding to a and b respectively. We first define pmin 2logn ℓ{n. The probability induced by the vertices of the binary tree are the following: the probability at the root is p pmin p1 pminq{logpℓq, and the probability induced by a node at distance d from the root is pd 1qp. In particular, the probability induced by the leaves is pmin logpℓqp1 pminq{logpℓq 1.