Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Coresets for Clustering Under Stochastic Noise

Authors: Lingxiao Huang, Zhize Li, Nisheeth K. Vishnoi, Runkai Yang, Haoyu Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results, in Section 4, support our theoretical findings even for datasets that do not meet the assumptions required by our theoretical analysis (see, e.g., Table 1), and in scenarios involving non-i.i.d. noise across dimensions (see, e.g., Table 7). Overall, our algorithm effectively generates small coresets with theoretical quality bounds, which can be integrated into clustering frameworks, enhancing robustness in noisy environments.
Researcher Affiliation Academia Lingxiao Huang Nanjing University Zhize Li Singapore Management University Nisheeth K. Vishnoi Yale University Runkai Yang Nanjing University Haoyu Zhao Princeton University
Pseudocode Yes Algorithm 1 A coreset algorithm CNα using the Errα metric under the noise model I
Open Source Code Yes https://github.com/xiaohuangyang/Coresets-for-Clusterin g-Under-Stochastic-Noise
Open Datasets Yes We consider the k-MEANS problem on the Adult [61] and Census1990 [68] datasets from the UCI Repository.
Dataset Splits No The paper does not explicitly describe training/test/validation dataset splits. It mentions using entire datasets (Adult, Census1990) and varying noise levels and tolerance thresholds for evaluation.
Hardware Specification Yes All experiments are conducted using Python 3.11 on an Apple M3 Pro machine with an 11-core CPU, 14-core GPU, and 36 GB of memory.
Software Dependencies No The paper mentions 'Python 3.11' but does not specify versions for other key software libraries or dependencies, which is required for a reproducible description.
Experiment Setup Yes We set k = 10. We perturb each dataset under noise model I, using Gaussian noise with θ {0, 0.01, 0.05, 0.25}, where θ = 0 denotes the noise-free case. For varying tolerance levels ε {0.1, 0.15, 0.2, 0.25, 0.3}, we construct a coreset S from b P using CN and CNα. For the initialization of our algorithms, we run k-means++ with max_iter = 5 on b P to obtain a fast O(1)-approximate solution.