MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit Tests with Autocorrelations

Authors: Zhangyu Wang, Gengchen Mai, Krzysztof Janowicz, Ni Lao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on two synthetic and seven real-world datasets which cover both temporal and spatial clustering tasks. We compare MC-GTA with a wide range of baselines. The detailed experiment setup, baseline algorithms and evaluation metrics can be found in Appendix A.5.
Researcher Affiliation Collaboration Zhangyu Wang 1 Gengchen Mai 2 3 Krzysztof Janowicz 4 1 Ni Lao 5 1Department of Geography, University of California Santa Barbara, CA, USA 2Department of Geography, University of Georgia, GA, USA 3SEAI Lab, Department of Geography and the Environment, University of Texas at Austin, TX, USA 4Faculty of Geosciences, Geography and Astronomy, University of Vienna, Vienna, Austria 5Google, Mountain View, CA, USA.
Pseudocode Yes Algorithm 1 MC-GTA Algorithm Input : A dataset D of N observations {Xi = (fi F, pi M)}N i=1. The distance function dm. The metric penalty function r. The model fitting algorithm GL. The density-based clustering algorithm DB. The number of neighbors n used for model fitting. The metric-constraint strength β. The margin hyperparameter δ. Output :A clustering C = {Ck}K k=1 1 for each observation Xi D 2 find n nearest observations Ni in the metric space 3 fit the model parameters θi GL(Ni) (Sec 5.1) 4 for each pair of observations, compute their 5 model dissimilarity dm(i, j) W 2 2 (θi, θj) (Eq 5) 6 metric distance dc(i, j) dc(pi, pj) 7 compute empirical generalized semivariogram ˆγm (Eq 6) 8 fit theoretical generalized semivariogram γm from ˆγm 9 compute range ρ from γm (Sec 4.1) 10 compute loss matrix M w i,j dm(i, j) + βrρ(i, j) (Eq 10) 11 run density-based clustering algorithm C DB(M w) 12 return C
Open Source Code Yes The implementation of our algorithm and a tutorial is publicized on Git Hub via https: //github.com/Octopolugal/MC-GTA.git.
Open Datasets Yes The detailed information can be found at https://timeseriesclassification.com/description.php? Dataset=Asphalt Pavement Type. (Pavement Dataset)
Dataset Splits No No explicit training/validation/test dataset splits with percentages or absolute sample counts are provided. The paper mentions 'We conduct hyperparameter tuning on the number of neighbors n, the weight β, and the margin δ' which implies a validation process but doesn't specify how data was split for this.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions 'sklearn(Pedregosa et al., 2011)' and 'scikit-learn: Machine learning in Python' (in references), indicating Python and scikit-learn are used. It also mentions 'Mclust version 3: An r package' (in references). However, no specific version numbers for these software dependencies are explicitly stated in the paper.
Experiment Setup Yes We conduct hyperparameter tuning on the number of neighbors n, the weight β, and the margin δ. The only hyperparameters we preset are cluster number K, feature dimension D, noise scale α and sample batch size k.