Clustering Semi-Random Mixtures of Gaussians
Authors: Aravindan Vijayaraghavan, Pranjal Awasthi
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd s algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Rutgers University, USA. 2EECS Department, Northwestern University, USA. |
| Pseudocode | Yes | Algorithm 1 Lloyd s Algorithm Input: A be the N d data matrix with rows Ai for i [N]. Use A to compute initial centers µ(1) 0 , µ(2) 0 , . . . µ(k) 0 as detailed in Proposition 3.2. Use these k-centers to seed a series of Lloyd-type iterations i.e., for r = 1, 2, . . .: do Set Zi be the set of points for which the closest center among µ(1) r 1, µ(2) r 1, . . . , µ(k) r 1 is µ(i) r 1. Set µ(i) r 1 |Zi| P Aj Zi Aj. end for |
| Open Source Code | No | The paper does not contain any statements or links indicating that open-source code for the described methodology is provided. |
| Open Datasets | No | The paper is theoretical and focuses on a mathematical model (semi-random GMM). It does not use or provide access information for a specific public dataset for experimental training. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments with training, validation, and test dataset splits. |
| Hardware Specification | No | The paper focuses on theoretical analysis and algorithms; it does not describe any empirical experiments that would require hardware specifications. |
| Software Dependencies | No | The paper focuses on theoretical analysis and algorithms; it does not describe any empirical experiments that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or system-level training settings. |