Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Coresets for Clustering Under Stochastic Noise

Authors: Lingxiao Huang, Zhize Li, Nisheeth K. Vishnoi, Runkai Yang, Haoyu Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results, in Section 4, support our theoretical findings even for datasets that do not meet the assumptions required by our theoretical analysis (see, e.g., Table 1), and in scenarios involving non-i.i.d. noise across dimensions (see, e.g., Table 7). Overall, our algorithm effectively generates small coresets with theoretical quality bounds, which can be integrated into clustering frameworks, enhancing robustness in noisy environments.
Researcher Affiliation	Academia	Lingxiao Huang Nanjing University Zhize Li Singapore Management University Nisheeth K. Vishnoi Yale University Runkai Yang Nanjing University Haoyu Zhao Princeton University
Pseudocode	Yes	Algorithm 1 A coreset algorithm CNα using the Errα metric under the noise model I
Open Source Code	Yes	https://github.com/xiaohuangyang/Coresets-for-Clusterin g-Under-Stochastic-Noise
Open Datasets	Yes	We consider the k-MEANS problem on the Adult [61] and Census1990 [68] datasets from the UCI Repository.
Dataset Splits	No	The paper does not explicitly describe training/test/validation dataset splits. It mentions using entire datasets (Adult, Census1990) and varying noise levels and tolerance thresholds for evaluation.
Hardware Specification	Yes	All experiments are conducted using Python 3.11 on an Apple M3 Pro machine with an 11-core CPU, 14-core GPU, and 36 GB of memory.
Software Dependencies	No	The paper mentions 'Python 3.11' but does not specify versions for other key software libraries or dependencies, which is required for a reproducible description.
Experiment Setup	Yes	We set k = 10. We perturb each dataset under noise model I, using Gaussian noise with θ {0, 0.01, 0.05, 0.25}, where θ = 0 denotes the noise-free case. For varying tolerance levels ε {0.1, 0.15, 0.2, 0.25, 0.3}, we construct a coreset S from b P using CN and CNα. For the initialization of our algorithms, we run k-means++ with max_iter = 5 on b P to obtain a fast O(1)-approximate solution.