Fast and Simple Spectral Clustering in Theory and Practice

Authors: Peter Macgregor

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.
Researcher Affiliation Academia Peter Macgregor School of Informatics University of Edinburgh peter.macgregor@ed.ac.uk
Pseudocode Yes Algorithm 1: POWERMETHOD(M Rn n, x0 Rn, t Z 0); Algorithm 2: FASTSPECTRALCLUSTER(G = (V, E), k Z 0, ϵ [0, 1])
Open Source Code Yes The code to reproduce the experiments is available at https://github.com/pmacg/fast-spectral-clustering.
Open Datasets Yes We evaluate the spectral clustering algorithms on synthetic data drawn from the stochastic block model. [...] MNIST [18]: [...] Pen Digits [1]: [...] Fashion [34]: [...] HAR (Human Activity Recognition) [2]: [...] Letter [9]: [...] The datasets are all made available by the Open ML [31] project, and can be downloaded with the scikit-learn library [27].
Dataset Splits No The paper mentions using synthetic and real-world datasets for evaluation but does not specify explicit training, validation, or test splits with percentages, sample counts, or references to predefined splits for reproducibility. For synthetic data, it describes generation parameters, and for real-world data, it states pre-processing steps but no data partitioning for training/validation/testing.
Hardware Specification Yes All experiments are performed on an HP laptop with an 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz processor and 32 GB RAM.
Software Dependencies No We implement all algorithms in Python, using the numpy [12], scipy [32], stag [22], and scikit-learn [27], libraries for matrix manipulation, eigenvector computation, graph processing, and k-means approximation respectively. While the libraries are mentioned, no specific version numbers for Python or any of these libraries are provided.
Experiment Setup Yes Remark 3.3. The constants in the definition of l and t in Algorithm 2 are based on those in the analysis of Makarychev et al. [23] and this paper. In practice, we find that setting l = log(k) and t = 10 log(n/k) works well. [...] Given parameters n Z 0, k Z 0, p [0, 1], and q [0, 1], we generate a graph G = (V, E) with n vertices and k ground-truth clusters S1, . . . , Sk of size n/k. [...] We first pre-process each dataset by computing the k nearest neighbour graph from the data, for k = 10.