Convex Optimization Procedure for Clustering: Theoretical Revisit

Authors: Changbo Zhu, Huan Xu, Chenlei Leng, Shuicheng Yan

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present theoretical analysis of SON a convex optimization procedure for clustering using a sum-of-norms (SON) regularization recently proposed in [8, 10, 11, 17]. In particular, we show if the samples are drawn from two cubes, each being one cluster, then SON can provably identify the cluster membership provided that the distance between the two cubes is larger than a threshold which (linearly) depends on the size of the cube and the ratio of numbers of samples in each cluster. To the best of our knowledge, this paper is the first to provide a rigorous analysis to understand why and when SON works. We believe this may provide important insights to develop novel convex optimization based algorithms for clustering.We now report some numerical experimental results. The empirical performance of SON has been reported in numerous works [8, 10, 11]. It has been shown that SON outperforms traditional clustering methods like K-means in many situations. As such, we do not reproduce such results. Instead, we conduct experiments to validate our theoretic results.
Researcher Affiliation Academia Changbo Zhu Department of Electrical and Computer Engineering Department of Mathematics National University of Singapore elezhuc@nus.edu.sg Huan Xu Department of Mechanical Engineering National University of Singapore mpexuh@nus.edu.sg Chenlei Leng Department of Statistics University of Warwick c.leng@warwick.ac.uk Shuicheng Yan Department of Electrical and Computer Engineering National University of Singapore eleyans@nus.edu.sg
Pseudocode No The paper describes mathematical proofs and derivations. The paper provides mathematical formulations and a proof sketch for Theorem 1 but does not include any structured pseudocode or algorithm blocks.
Open Source Code No Hocking et al. [8] proposed SON, arguing that it can be seen as a generalization of hierarchical clustering, and presented via numerical simulations several situations in which SON works while K-means and average linkage hierarchical clustering fail. They also developed R package called clusterpath which can be used to solve Problem (1). The paper mentions an R package developed by other researchers (Hocking et al.) that can solve the problem, but it does not state that the authors of this paper are providing open-source code for their specific work.
Open Datasets No In the experiments, we focus on the samples chosen from R2, i.e., p = 2, and use synthetic data to obtain the empirical performance. The paper uses synthetic data, which is generated for the experiment, and does not provide access information (link, DOI, repository name, or formal citation with authors/year) for a publicly available or open dataset.
Dataset Splits No We randomly draw a data matrix A where each row belongs to one of the two cubes, and find numerically the largest distance d1,2 between the cubes where the cluster membership is not correctly recovered. ... Repeat and sample m data matrices {Ad 1, Ad 2, , Ad m}. The paper describes a procedure for generating synthetic data and repeating experiments, but it does not specify explicit train/validation/test dataset splits for model training and evaluation.
Hardware Specification No The paper describes numerical experiments to validate theoretical results but does not provide any specific details about the hardware used. The paper mentions conducting numerical experiments but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for these experiments.
Software Dependencies No Hocking et al. [8] ... also developed R package called clusterpath which can be used to solve Problem (1). The paper mentions a specific R package used by others, but it does not list the specific version numbers of software dependencies used for their own experiments or theoretical work.
Experiment Setup Yes 1. Choose two cubes C1 and C2 from space Rp with size s1 = 2(σ11, σ12, , σ1p) and s2 = 2(σ21, σ22, , σ2p), and the distance between C1 and C2 is d. 2. Choose arbitrarily n1 points from C1 and n2 points from C2 and form the data matrix Ad of dimension n p. Repeat and sample m data matrices {Ad 1, Ad 2, , Ad m}. Theorem 1. ... if w1,2 < d1,2, then by choosing the parameter α R such that w1,2 < n 2 α < d1,2, we have the following: 1. SON can correctly determine the cluster membership of A;