reproducibilityindex.ai

New results on information theoretic clustering

Authors: Ferdinando Cicalese, Eduardo Laber, Lucas Murtinho

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also report experiments with a new clustering method that was designed on top of the theoretical tools leading to the above results. These experiments suggest a practical applicability for our method, in particular, when the number of clusters is large. Experiments. We compared RATIO-GREEDY with DIVISIVE CLUSTERING (DC for short), an adaptation of the k-means method proposed in (Dhillon et al., 2003) to solve PMWIPEnt.
Researcher Affiliation	Academia	1Department of Computer Science, University of Verona, Verona, Italy 2Departamento de Inform atica, PUC-RIO, Rio de Janeiro, Brazil.
Pseudocode	Yes	The RATIO-GREEDY algorithm. If k g, RATIO-GREEDY runs the ADom, the dominance algorithm presented in Section 4. Thus, we focus on the case k > g. Let Vi be the set of i-dominant vectors and let Li be the list obtained by sorting the vectors in Vi according to their ratios as deﬁned in Section 5. For the following explanation it will be convenient to think of Li as a list of adjacent clusters that initially contains \|Vi\| unitary clusters. RATIO-GREEDY predeﬁnes the number of clusters ti that will be available for the i-dominant vectors so that Pg i=1 ti = k. It heuristically set ti = k IEnt(Vi)/ Pg j=1 IEnt(Vj). Then, for each i, RATIO-GREEDY greedily reduces the number of clusters in Li from \|Vi\| to ti by iteratively selecting two adjacent clusters in the current list and replacing them with their union so that a list with one less cluster is obtained. The pair of adjacent clusters that is selected to be merged, at each iteration, is the one for which loss( , ) is minimum, where the loss(C, C ) of two clusters C and C is given by loss(C, C ) = IEnt(C C ) IEnt(C) IEnt(C ).
Open Source Code	Yes	Code and datasets are available in (Murtinho, 2019). URL https://github.com/lmurtinho/ Ratio Greedy Clustering/tree/ICML_ submission.
Open Datasets	Yes	We tested these methods on clustering 51.480 words from the 20NEWS corpus and 170.946 words from RCV1 corpus, according to their distributions w.r.t. 20 and 103 different classes respectively. The distribution vectors associated with the words are built according to the methodology employed in (Dhillon et al., 2003) to address text classiﬁcation tasks. Code and datasets are available in (Murtinho, 2019). URL https://github.com/lmurtinho/ Ratio Greedy Clustering/tree/ICML_ submission.
Dataset Splits	No	The paper uses the 20NEWS and RCV1 corpora for evaluation but does not specify explicit dataset splits (e.g., percentages or sample counts for training, validation, and testing sets).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	When k > g, the initialization of DC resembles ADom since it consists of splitting the vectors in Vi among k/g clusters. When k g we initialize DC using ADom.