New results on information theoretic clustering

Authors: Ferdinando Cicalese, Eduardo Laber, Lucas Murtinho

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also report experiments with a new clustering method that was designed on top of the theoretical tools leading to the above results. These experiments suggest a practical applicability for our method, in particular, when the number of clusters is large. Experiments. We compared RATIO-GREEDY with DIVISIVE CLUSTERING (DC for short), an adaptation of the k-means method proposed in (Dhillon et al., 2003) to solve PMWIPEnt.
Researcher Affiliation Academia 1Department of Computer Science, University of Verona, Verona, Italy 2Departamento de Inform atica, PUC-RIO, Rio de Janeiro, Brazil.
Pseudocode Yes The RATIO-GREEDY algorithm. If k g, RATIO-GREEDY runs the ADom, the dominance algorithm presented in Section 4. Thus, we focus on the case k > g. Let Vi be the set of i-dominant vectors and let Li be the list obtained by sorting the vectors in Vi according to their ratios as defined in Section 5. For the following explanation it will be convenient to think of Li as a list of adjacent clusters that initially contains |Vi| unitary clusters. RATIO-GREEDY predefines the number of clusters ti that will be available for the i-dominant vectors so that Pg i=1 ti = k. It heuristically set ti = k IEnt(Vi)/ Pg j=1 IEnt(Vj). Then, for each i, RATIO-GREEDY greedily reduces the number of clusters in Li from |Vi| to ti by iteratively selecting two adjacent clusters in the current list and replacing them with their union so that a list with one less cluster is obtained. The pair of adjacent clusters that is selected to be merged, at each iteration, is the one for which loss( , ) is minimum, where the loss(C, C ) of two clusters C and C is given by loss(C, C ) = IEnt(C C ) IEnt(C) IEnt(C ).
Open Source Code Yes Code and datasets are available in (Murtinho, 2019). URL https://github.com/lmurtinho/ Ratio Greedy Clustering/tree/ICML_ submission.
Open Datasets Yes We tested these methods on clustering 51.480 words from the 20NEWS corpus and 170.946 words from RCV1 corpus, according to their distributions w.r.t. 20 and 103 different classes respectively. The distribution vectors associated with the words are built according to the methodology employed in (Dhillon et al., 2003) to address text classification tasks. Code and datasets are available in (Murtinho, 2019). URL https://github.com/lmurtinho/ Ratio Greedy Clustering/tree/ICML_ submission.
Dataset Splits No The paper uses the 20NEWS and RCV1 corpora for evaluation but does not specify explicit dataset splits (e.g., percentages or sample counts for training, validation, and testing sets).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes When k > g, the initialization of DC resembles ADom since it consists of splitting the vectors in Vi among k/g clusters. When k g we initialize DC using ADom.