VRCA: A Clustering Algorithm for Massive Amount of Texts

Authors: Ming Liu, Lei Chen, Bingquan Liu, Xiaolong Wang

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The following experiments are partitioned into three parts. The first part discusses how to set threshold (denoted as maxp) to choose features to reconstruct cluster s vector. The second and the third parts are conducted to compare our algorithm with several popular baseline algorithms in time performance and clustering precision to demonstrate our algorithm s high quality. The baseline algorithms include K-means, STING, DBSCAN, BIRCH, GSOM, GHSOM, Spectral Clustering, Non-Regression Matrix Factorization (NRMF), and LDCC. Four testing collections are employed in the experiments, including two prevalent testing collections, Reuters (21,578) and Newsgroup (2,000), and two large-scale text collections, TRC2 (1.8 millions), and released Clue Web9 after removing empty and short texts (100 millions).
Researcher Affiliation Academia Ming Liu Lei Chen Bingquan Liu Xiaolong Wang HIT, China BNUZ, China HIT, China HIT, China liuming1981@hit.edu.cn chenlei@bnuz.edu.cn {liubq, wangxl}@insun.hit.edu.cn
Pseudocode Yes Algorithm Workflow Input: text set D; maximum neuron (or cluster) number k; current neuron (or cluster) number ck; convergence condition maxth; iterative index t; the quantity of steps to enter into overall tuning sub-process maxt; the threshold to limit the selected features maxp. Output: neuron (or cluster) set N (or C, in our algorithm, one neuron corresponds to one cluster).
Open Source Code No The paper does not provide any statement about making the source code for their proposed algorithm publicly available or providing a link to a code repository.
Open Datasets Yes Two large-scale text collections, Clue Web9 and TRC2, are adopted to test the performance of our algorithm. Besides, another two popular text collections (Reuters and Newsgroup) are also adopted to prove our algorithm s high performance on small-scale text collection. ... Four testing collections are employed in the experiments, including two prevalent testing collections, Reuters (21,578) and Newsgroup (2,000), and two large-scale text collections, TRC2 (1.8 millions), and released Clue Web9 after removing empty and short texts (100 millions).
Dataset Splits No The paper mentions using specific datasets for testing but does not provide details on how these datasets are split into training, validation, or test sets, nor does it specify any cross-validation setup.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to conduct the experiments.
Software Dependencies No The paper describes the algorithms and models used but does not list any specific software dependencies or their version numbers that would be needed for reproducibility.
Experiment Setup Yes Parameter Setting The previous workflow has five initial parameters needed to be fixed in advance. They are k, ink, maxth, maxt, maxp. For k, it denotes the maximum cluster number. We set it as the square root of n. ... For ink, it denotes the initial neuron number. We set it as 2 without loss of generality. For maxth and maxt, they respectively denote convergence condition and the quantity of steps during partial tuning sub-process. We set them according to the parameter setting in [Alahakoon et al., 2000]. ... For maxp, it denotes the threshold to limit the number of selected features to reconstruct cluster s vector. We set it as 300 according to the analysis in [Martin et al., 2004; Liu et al., 2014] and our experimental results in Section 4.1 (Figure 1 and Table 1).