Robust Mixture Learning when Outliers Overwhelm Small Groups

Authors: Daniil Dmitriev, Rares-Darius Buhai, Stefan Tiegel, Alexander Wolters, Gleb Novikov, Amartya Sanyal, David Steurer, Fanny Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In synthetic experiments, we implement our meta-algorithm with the LD-ME base learner from [8] and show clear improvements compared to the only prior method with guarantees, while being comparable or better than popular clustering methods such as k-means and DBSCAN for various attack models.
Researcher Affiliation Academia 1ETH Zurich 2TU Munich 3Lucerne School of Computer Science and Information Technology 4University of Copenhagen
Pseudocode Yes Algorithm 1 Outer stage, informal (see Algorithm 6)
Open Source Code Yes We uploaded zip archive containing the code together with instructions on how to reproduce the experiments.
Open Datasets Yes We consider a mixture of k = 7 well-separated ( µi µj 40) d = 100 dimensional inlier clusters whose subgroup sizes range from 0.3 to 0.02. The experiments are conducted once using a Gaussian distribution and once using a heavy-tailed t-distribution with five degrees of freedom for both inlier and adversarial clusters.
Dataset Splits No Each parameter setting is executed 100 times to account for stochastic variations in the algorithmic procedures, such as k-means initialization.
Hardware Specification Yes It utilizes CPU resources of an internal cluster with 128 cores, which results in a execution time of about 5 minutes for a single run of the experiment for one noise model with 10000 samples.
Software Dependencies No We implement the list-decodable mean estimation base learner in our Inner Stage algorithm (Algorithm 3) based on [8].
Experiment Setup Yes The hyper-parameters of our algorithm are tuned beforehand based on the experimental setup. For the comparative algorithms, hyper-parameter searches are conducted within each experiment after initial tuning. For our algorithm, key parameters include the pruning radius γ used in the Outer Stage routine (Algorithm 6) and β used in the Inner Stage (Algorithm 4). In addition, parameters for the LD-ME base learner, such as the cluster concentration threshold, also require careful selection, resulting in a total of 7 parameters. The tuning for these was performed using a grid search comparing about 250 different configurations.