Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Efficient and Effective Generic Agglomerative Hierarchical Clustering Approach

Authors: Julien Ah-Pine

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Artificial and real-world benchmarks are used to exemplify these points. From a theoretical standpoint, SNK-AHC provides another interpretation of the classic techniques which relies on the concept of weighted penalized similarities. ... Section 6 is dedicated to the experiments which are carried out on both artificial and real-world data sets.
Researcher Affiliation Academia Julien Ah-Pine EMAIL University of Lyon, Lyon 2 ERIC EA3083 5 avenue Pierre Mend es France 69676 Bron Cedex, France
Pseudocode Yes Algorithm 1: General procedure of D-AHC. Algorithm 2: General procedure of K-AHC. Algorithm 3: General procedure of the K-AHC based stored data matrix approach. Algorithm 4: General procedure of SNK-AHC. Algorithm 5: Connected components determination.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the methodology described, nor does it include links to any code repositories.
Open Datasets Yes We both use artificial and real-world problems which are freely available at (Franti and et al, 2015) and (Lichman, 2013) respectively. ... The first collection is called the landsat data set https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) ... The second collection we used, is called the pendigits data set https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
Dataset Splits No For each obtained dendrogram, we cut the forest so as to obtain the correct number of clusters denoted κ . Note that if κ, the number of clusters found by Algorithm 4, is greater than κ then, we keep the partition with κ clusters. Afterward, we compare the resulting partition and the ground-truth. The evaluation measure used in this case is the famous adjusted Rand index (Hubert and Arabie, 1985) which is denoted ARI. The paper describes a clustering evaluation methodology against a known ground truth, but does not specify train/test/validation splits typically used for supervised learning models.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software names with version numbers for the implementation of the described methodology. It only mentions 'popular SVM tools like (Chang and Lin, 2011)' in the context of a default setting for a Gaussian kernel, but not as software used by the authors with a version.
Experiment Setup Yes Regarding the Gaussian kernel, we remind its definition below: Sab = exp( γ xa xb 2), a, b O We set γ = 1/q, q being the number of descriptive variables. ... Concerning NNk, the distinct k values were successively set to (the nearest integer of) {100, 90, 75, 50, 25, 10, 1} percent of n, the total number of items. ... the sparsification method we used here is based on a threshold following (26). The different θ values were chosen so that a certain level of sparsity is reached. Precisely, they correspond to the {100, 90, 75, 50, 25, 10, 1}th percentiles of the similarity values distribution.