Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Balanced Clustering: A Uniform Model and Fast Algorithm
Authors: Weibo Lin, Zhu He, Mingyu Xiao
IJCAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results over benchmarks validate the advantage of our algorithm compared to the state-of-the-art balanced clustering algorithms. On most datasets, our algorithm runs more than 100 times faster than previous algorithms with a better solution. |
| Researcher Affiliation | Collaboration | Weibo Lin1 , Zhu He1,2 and Mingyu Xiao1 1School of Computer Science, University of Electronic Science and Technology of China 2Zhejiang Cainiao Supply Chain Management Co. Ltd. |
| Pseudocode | Yes | Algorithm 1 Regularized k-means with warm start |
| Open Source Code | Yes | The codes and data in this paper are publicly available1. 1https://github.com/zhu-he/regularized-k-means |
| Open Datasets | Yes | In the experiments, we consider ten datasets, including six real-world UCI datasets2, two artificial datasets s1-s23 which have Gaussian clusters with increasing overlap and two MNIST datasets4 of handwritten digits. 2https://archive.ics.uci.edu/ml/index.php 3http://cs.uef.fi/sipu/datasets/ 4http://yann.lecun.com/exdb/mnist/ |
| Dataset Splits | No | The paper refers to "MNIST-train" and "MNIST-test" datasets, implying training and testing sets. However, it does not provide specific details about how data was split for training, validation, and testing (e.g., percentages, counts, or a specific validation split methodology). |
| Hardware Specification | Yes | As a platform, Intel Core i7-8700K 3.70 GHz processor with 16GB memory was used. |
| Software Dependencies | No | The paper mentions that "LKM was re-implemented in C++ by ourselves. Our method is also implemented in C++." However, it does not provide specific version numbers for any C++ compiler, libraries, or other software dependencies used in the experiments. |
| Experiment Setup | Yes | Regularization settings. To get a contrastive result between our method and LKM, we set the regularization term in our model the same as that in LKM by letting f1(x) = = fk(x) = λ x2, (16) where λ is the balance parameter. In order to obtain different balancing performance, different values of λ are tested. Specifically, let V ar = Pn i=1 ||xi µ||2 2 where µ = Pn i=1 xi n be the variance of the dataset {xi}n i=1, the values of λ are uniformly chosen from the interval [0, 40V ar kn2 ]. For hardbalanced clustering, we use the following functions to derive a strictly balanced result: M x if x < n k x) if x > n k otherwise, (17) for h = 1, . . . , k, where M is a large real number. In the experiments, it is sufficient to set M = V ar. |