Balanced Clustering: A Uniform Model and Fast Algorithm

Authors: Weibo Lin, Zhu He, Mingyu Xiao

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results over benchmarks validate the advantage of our algorithm compared to the state-of-the-art balanced clustering algorithms. On most datasets, our algorithm runs more than 100 times faster than previous algorithms with a better solution.
Researcher Affiliation Collaboration Weibo Lin1 , Zhu He1,2 and Mingyu Xiao1 1School of Computer Science, University of Electronic Science and Technology of China 2Zhejiang Cainiao Supply Chain Management Co. Ltd.
Pseudocode Yes Algorithm 1 Regularized k-means with warm start
Open Source Code Yes The codes and data in this paper are publicly available1. 1https://github.com/zhu-he/regularized-k-means
Open Datasets Yes In the experiments, we consider ten datasets, including six real-world UCI datasets2, two artificial datasets s1-s23 which have Gaussian clusters with increasing overlap and two MNIST datasets4 of handwritten digits. 2https://archive.ics.uci.edu/ml/index.php 3http://cs.uef.fi/sipu/datasets/ 4http://yann.lecun.com/exdb/mnist/
Dataset Splits No The paper refers to "MNIST-train" and "MNIST-test" datasets, implying training and testing sets. However, it does not provide specific details about how data was split for training, validation, and testing (e.g., percentages, counts, or a specific validation split methodology).
Hardware Specification Yes As a platform, Intel Core i7-8700K 3.70 GHz processor with 16GB memory was used.
Software Dependencies No The paper mentions that "LKM was re-implemented in C++ by ourselves. Our method is also implemented in C++." However, it does not provide specific version numbers for any C++ compiler, libraries, or other software dependencies used in the experiments.
Experiment Setup Yes Regularization settings. To get a contrastive result between our method and LKM, we set the regularization term in our model the same as that in LKM by letting f1(x) = = fk(x) = λ x2, (16) where λ is the balance parameter. In order to obtain different balancing performance, different values of λ are tested. Specifically, let V ar = Pn i=1 ||xi µ||2 2 where µ = Pn i=1 xi n be the variance of the dataset {xi}n i=1, the values of λ are uniformly chosen from the interval [0, 40V ar kn2 ]. For hardbalanced clustering, we use the following functions to derive a strictly balanced result: M x if x < n k x) if x > n k otherwise, (17) for h = 1, . . . , k, where M is a large real number. In the experiments, it is sufficient to set M = V ar.