Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Balanced Clustering: A Uniform Model and Fast Algorithm

Authors: Weibo Lin, Zhu He, Mingyu Xiao

IJCAI 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results over benchmarks validate the advantage of our algorithm compared to the state-of-the-art balanced clustering algorithms. On most datasets, our algorithm runs more than 100 times faster than previous algorithms with a better solution.
Researcher Affiliation Collaboration Weibo Lin1 , Zhu He1,2 and Mingyu Xiao1 1School of Computer Science, University of Electronic Science and Technology of China 2Zhejiang Cainiao Supply Chain Management Co. Ltd.
Pseudocode Yes Algorithm 1 Regularized k-means with warm start
Open Source Code Yes The codes and data in this paper are publicly available1. 1https://github.com/zhu-he/regularized-k-means
Open Datasets Yes In the experiments, we consider ten datasets, including six real-world UCI datasets2, two artificial datasets s1-s23 which have Gaussian clusters with increasing overlap and two MNIST datasets4 of handwritten digits. 2https://archive.ics.uci.edu/ml/index.php 3http://cs.uef.fi/sipu/datasets/ 4http://yann.lecun.com/exdb/mnist/
Dataset Splits No The paper refers to "MNIST-train" and "MNIST-test" datasets, implying training and testing sets. However, it does not provide specific details about how data was split for training, validation, and testing (e.g., percentages, counts, or a specific validation split methodology).
Hardware Specification Yes As a platform, Intel Core i7-8700K 3.70 GHz processor with 16GB memory was used.
Software Dependencies No The paper mentions that "LKM was re-implemented in C++ by ourselves. Our method is also implemented in C++." However, it does not provide specific version numbers for any C++ compiler, libraries, or other software dependencies used in the experiments.
Experiment Setup Yes Regularization settings. To get a contrastive result between our method and LKM, we set the regularization term in our model the same as that in LKM by letting f1(x) = = fk(x) = λ x2, (16) where λ is the balance parameter. In order to obtain different balancing performance, different values of λ are tested. Specifically, let V ar = Pn i=1 ||xi µ||2 2 where µ = Pn i=1 xi n be the variance of the dataset {xi}n i=1, the values of λ are uniformly chosen from the interval [0, 40V ar kn2 ]. For hardbalanced clustering, we use the following functions to derive a strictly balanced result: M x if x < n k x) if x > n k otherwise, (17) for h = 1, . . . , k, where M is a large real number. In the experiments, it is sufficient to set M = V ar.