Robust Mixture Learning when Outliers Overwhelm Small Groups
Authors: Daniil Dmitriev, Rares-Darius Buhai, Stefan Tiegel, Alexander Wolters, Gleb Novikov, Amartya Sanyal, David Steurer, Fanny Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In synthetic experiments, we implement our meta-algorithm with the LD-ME base learner from [8] and show clear improvements compared to the only prior method with guarantees, while being comparable or better than popular clustering methods such as k-means and DBSCAN for various attack models. |
| Researcher Affiliation | Academia | 1ETH Zurich 2TU Munich 3Lucerne School of Computer Science and Information Technology 4University of Copenhagen |
| Pseudocode | Yes | Algorithm 1 Outer stage, informal (see Algorithm 6) |
| Open Source Code | Yes | We uploaded zip archive containing the code together with instructions on how to reproduce the experiments. |
| Open Datasets | Yes | We consider a mixture of k = 7 well-separated ( µi µj 40) d = 100 dimensional inlier clusters whose subgroup sizes range from 0.3 to 0.02. The experiments are conducted once using a Gaussian distribution and once using a heavy-tailed t-distribution with five degrees of freedom for both inlier and adversarial clusters. |
| Dataset Splits | No | Each parameter setting is executed 100 times to account for stochastic variations in the algorithmic procedures, such as k-means initialization. |
| Hardware Specification | Yes | It utilizes CPU resources of an internal cluster with 128 cores, which results in a execution time of about 5 minutes for a single run of the experiment for one noise model with 10000 samples. |
| Software Dependencies | No | We implement the list-decodable mean estimation base learner in our Inner Stage algorithm (Algorithm 3) based on [8]. |
| Experiment Setup | Yes | The hyper-parameters of our algorithm are tuned beforehand based on the experimental setup. For the comparative algorithms, hyper-parameter searches are conducted within each experiment after initial tuning. For our algorithm, key parameters include the pruning radius γ used in the Outer Stage routine (Algorithm 6) and β used in the Inner Stage (Algorithm 4). In addition, parameters for the LD-ME base learner, such as the cluster concentration threshold, also require careful selection, resulting in a total of 7 parameters. The tuning for these was performed using a grid search comparing about 250 different configurations. |