Efficient Clustering Based On A Unified View Of $K$-means And Ratio-cut

Authors: Shenfei Pei, Feiping Nie, Rong Wang, Xuelong Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 12 real-world benchmark and 8 facial datasets validate the advantages of the proposed algorithm compared to the state-of-the-art clustering algorithms. In particular, over 15x and 7x speed-up can be obtained with respect to k-means on the synthetic dataset of 1 million samples and the benchmark dataset (Celeb A) of 200k samples, respectively.
Researcher Affiliation Academia Shenfei Pei School of Computer Science and Center for OPTIMAL Northwestern Polytechnical University shenfeipei@gmail.com Feiping Nie School of Computer Science and Center for OPTIMAL Northwestern Polytechnical University feipingnie@gmail.com Rong Wang School of Cybersecurity and Center for OPTIMAL Northwestern Polytechnical University wangrong07@tsinghua.org.cn Xuelong Li School of Computer Science and Center for OPTIMAL Northwestern Polytechnical University li@nwpu.edu.cn
Pseudocode Yes Algorithm 1: An efficient program for solving problem (21).
Open Source Code Yes In particular, over 15x and 7x speed-up can be obtained with respect to k-means on the synthetic dataset of 1 million samples and the benchmark dataset (Celeb A) of 200k samples, respectively [Git Hub].
Open Datasets Yes Web Face [50] and Celeb A [23] are two large-scale public datasets available for face recognition and verification problems. CALFW [54] and CPLFW [53] are two variants of LFW aiming at cross-age and cross-pose face recognition, respectively. CACD [5], Adience [15], and FERET [35] are constructed for cross-age face retrieval, age and gender recognition, and facial recognition system evaluation.
Dataset Splits No The paper does not explicitly provide details about validation dataset splits (e.g., percentages or sample counts).
Hardware Specification Yes Both k-means and our code run on the Arch machine with 3.20 GHz i7-8700 CPU, 32 GB main memory.
Software Dependencies No The paper mentions software like 'scikit-learn', 'C++', 'Dlib', and 'EFANNA', but it does not specify exact version numbers for any of these software dependencies.
Experiment Setup Yes The number of nearest neighbors k is fixed at 20 for 6 synthetic and 12 middle-scale real world datasets. The k-nearest neighbors graphs are generated by EFANNA [14] with k = 100 for all facial datasets. Every method takes 50 runs. The average results are reported.