Coresets for Clustering with Missing Values
Authors: Vladimir Braverman, Shaofeng Jiang, Robert Krauthgamer, Xuan Wu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement our algorithm and validate its performance on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniformsampling baseline. Furthermore, it significantly speeds up a Lloyd s-style heuristic for k-MEANS with missing values. [...] We implement our proposed coreset construction algorithm, and we evaluate its performance on real and synthetic datasets. [...] We run our experiments on three real datasets and one synthetic dataset. |
| Researcher Affiliation | Academia | Vladimir Braverman Johns Hopkins University vova@cs.jhu.edu Shaofeng H.-C. Jiang Peking University shaofeng.jiang@pku.edu.cn Robert Krauthgamer Weizmann Institute of Science robert.krauthgamer@weizmann.ac.il Xuan Wu Johns Hopkins University wu3412790@gmail.com |
| Pseudocode | No | The paper describes algorithmic concepts like importance sampling and Gonzalez's algorithm but does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any links to source code or explicit statements about its public release. It states "We implement our algorithm" but does not offer access to the code. |
| Open Datasets | Yes | Russian housing [Rus17] is a dataset on Russian house market. [...] KDDCup 2009 [KDD09] is a dataset on customer relationship prediction. [...] Vertical farming [Sam21] is a dataset about cubes which are used for advanced vertical farming. |
| Dataset Splits | No | We use a randomly selected collection of centers C that consists of 100 randomly generated k-subset C Rd. Since both the evaluation method and the algorithm has randomness, we run the experiment for T = 103 times with independent random bits and report the average empirical error to make it stable. |
| Hardware Specification | Yes | We implement the algorithms using C++ 11, on a laptop with Intel i5-8350U CPU and 8GB RAM. |
| Software Dependencies | No | We implement the algorithms using C++ 11. (While C++11 is a specific language version, the paper does not list other software dependencies or libraries with version numbers, nor does it name a self-contained solver/package with a specific version number.) |
| Experiment Setup | Yes | In our experiments, we follow a standard practice of fixing coreset size in each experiment. [...] For a fixed size coreset, the family size |I| is a parameter that needs to be optimized. [...] The experiments are run on the Russian housing data set where the number of iterations of the modified Lloyd s is set to T = 5000 and the number of clusters is set to a small value k = 3 so as the heuristic is likely to find a local minimum faster. |