Coresets for Clustering with Missing Values

Authors: Vladimir Braverman, Shaofeng Jiang, Robert Krauthgamer, Xuan Wu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement our algorithm and validate its performance on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniformsampling baseline. Furthermore, it significantly speeds up a Lloyd s-style heuristic for k-MEANS with missing values. [...] We implement our proposed coreset construction algorithm, and we evaluate its performance on real and synthetic datasets. [...] We run our experiments on three real datasets and one synthetic dataset.
Researcher Affiliation Academia Vladimir Braverman Johns Hopkins University vova@cs.jhu.edu Shaofeng H.-C. Jiang Peking University shaofeng.jiang@pku.edu.cn Robert Krauthgamer Weizmann Institute of Science robert.krauthgamer@weizmann.ac.il Xuan Wu Johns Hopkins University wu3412790@gmail.com
Pseudocode No The paper describes algorithmic concepts like importance sampling and Gonzalez's algorithm but does not present them in structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any links to source code or explicit statements about its public release. It states "We implement our algorithm" but does not offer access to the code.
Open Datasets Yes Russian housing [Rus17] is a dataset on Russian house market. [...] KDDCup 2009 [KDD09] is a dataset on customer relationship prediction. [...] Vertical farming [Sam21] is a dataset about cubes which are used for advanced vertical farming.
Dataset Splits No We use a randomly selected collection of centers C that consists of 100 randomly generated k-subset C Rd. Since both the evaluation method and the algorithm has randomness, we run the experiment for T = 103 times with independent random bits and report the average empirical error to make it stable.
Hardware Specification Yes We implement the algorithms using C++ 11, on a laptop with Intel i5-8350U CPU and 8GB RAM.
Software Dependencies No We implement the algorithms using C++ 11. (While C++11 is a specific language version, the paper does not list other software dependencies or libraries with version numbers, nor does it name a self-contained solver/package with a specific version number.)
Experiment Setup Yes In our experiments, we follow a standard practice of fixing coreset size in each experiment. [...] For a fixed size coreset, the family size |I| is a parameter that needs to be optimized. [...] The experiments are run on the Russian housing data set where the number of iterations of the modified Lloyd s is set to T = 5000 and the number of clusters is set to a small value k = 3 so as the heuristic is likely to find a local minimum faster.