Dependency Clustering of Mixed Data with Gaussian Mixture Copulas

Authors: Vaibhav Rajan, Sakyajit Bhattacharya

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate performance improvements over state-of-the-art methods of correlation clustering on synthetic and benchmark datasets. Our experimental results demonstrate the efficacy of our method, that outperforms state-of-the-art methods for correlation clustering on synthetic and real benchmark data sets with mixed features, thus illustrating the advantage of our copula-based approach for dependency clustering.
Researcher Affiliation Industry Vaibhav Rajan, Sakyajit Bhattacharya Xerox Research Centre India {vaibhav.rajan, sakyajit.bhattacharya}@xerox.com
Pseudocode Yes Algorithm 1 EGMCM Input: R(Y), scaled rank transformed data Y; G, number of clusters Initialization Z = Φ 1(R(Y)) loop Estimate, via EM, GMM parameters # = [ g, µg, g] Resample Z: for j: 1 to p do for all y 2 unique {y1j, . . . , ynj} do Compute zlj = max{zij : yij < y} and zuj = min{zij : y < yij} For each i such that yij = y: Sample rgij from TN(µgj, σgij, zlj, zuj) Set zij = PG g=1 grgij end for end for end loop Output: Cluster labels (latent variables of GMM (Z|#))
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology.
Open Datasets Yes We compare the performance of our algorithm on 10 benchmark datasets obtained from the UCI repository [Bache and Lichman, 2013]. Table 2: Details of datasets from UCI repository used in our experiments. n: number of observations, pnum: number of numerical features, pcat: number of discrete features, G: number of clusters. Asterisk: dataset contains missing values. [Bache and Lichman, 2013] K. Bache and M. Lichman. UCI machine learning repository, 2013.
Dataset Splits No The paper describes the datasets used (synthetic and UCI benchmark datasets) but does not provide specific details on train, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split references) required for reproduction.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup No The paper describes simulation settings and variations in parameters like 'G' (number of clusters) and 'n' (number of observations), but it does not provide specific experimental setup details such as hyperparameter values, optimizer settings, or other system-level training configurations for its algorithms.