Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Coresets for Clustering with Missing Values

Authors: Vladimir Braverman, Shaofeng Jiang, Robert Krauthgamer, Xuan Wu

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement our algorithm and validate its performance on various real data sets. Our coreset exhibits a ﬂexible tradeoff between coreset size and accuracy, and generally outperforms the uniformsampling baseline. Furthermore, it signiﬁcantly speeds up a Lloyd s-style heuristic for k-MEANS with missing values. [...] We implement our proposed coreset construction algorithm, and we evaluate its performance on real and synthetic datasets. [...] We run our experiments on three real datasets and one synthetic dataset.
Researcher Affiliation	Academia	Vladimir Braverman Johns Hopkins University EMAIL Shaofeng H.-C. Jiang Peking University EMAIL Robert Krauthgamer Weizmann Institute of Science EMAIL Xuan Wu Johns Hopkins University EMAIL
Pseudocode	No	The paper describes algorithmic concepts like importance sampling and Gonzalez's algorithm but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any links to source code or explicit statements about its public release. It states "We implement our algorithm" but does not offer access to the code.
Open Datasets	Yes	Russian housing [Rus17] is a dataset on Russian house market. [...] KDDCup 2009 [KDD09] is a dataset on customer relationship prediction. [...] Vertical farming [Sam21] is a dataset about cubes which are used for advanced vertical farming.
Dataset Splits	No	We use a randomly selected collection of centers C that consists of 100 randomly generated k-subset C Rd. Since both the evaluation method and the algorithm has randomness, we run the experiment for T = 103 times with independent random bits and report the average empirical error to make it stable.
Hardware Specification	Yes	We implement the algorithms using C++ 11, on a laptop with Intel i5-8350U CPU and 8GB RAM.
Software Dependencies	No	We implement the algorithms using C++ 11. (While C++11 is a specific language version, the paper does not list other software dependencies or libraries with version numbers, nor does it name a self-contained solver/package with a specific version number.)
Experiment Setup	Yes	In our experiments, we follow a standard practice of ﬁxing coreset size in each experiment. [...] For a ﬁxed size coreset, the family size \|I\| is a parameter that needs to be optimized. [...] The experiments are run on the Russian housing data set where the number of iterations of the modiﬁed Lloyd s is set to T = 5000 and the number of clusters is set to a small value k = 3 so as the heuristic is likely to ﬁnd a local minimum faster.