Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Online Coresets for Parameteric and Non-Parametric Bregman Clustering
Authors: Supratim Shit, Anirban Dasgupta, Rachit Chhaya, Jayesh Choudhari
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also present experiments to compare the performance of our algorithms with other sampling techniques. (Page 1) We present experimental results and compare the performance of our algorithms with other known coreset building techniques. The comparison is done on real-world datasets to support our theoretical claims. (Page 3) Finally we present some experimental results in section 7 on real datasets. |
| Researcher Affiliation | Collaboration | Rachit Chhaya EMAIL DAIICTGandhinagar, India Jayesh Choudhari EMAIL CUBE, England Anirban Dasgupta EMAIL IIT Gandhinagar, India Supratim Shit EMAIL Technion, Israel |
| Pseudocode | Yes | Algorithm 1 Bregman Filter (Page 7), Algorithm 2 Non Parametric Filter (Page 19). |
| Open Source Code | No | No explicit statement about code release or a link to a repository was found in the paper. |
| Open Datasets | Yes | We compare the performance on the following datasets: 1) KDD(BIO-TRAIN): 145, 751 points with 74 features... 2) MNIST: 60, 000 points in 784 dimension digits dataset... 3) SONGS: 515, 345 songs from the Million song dataset with 90 features. (Page 22, Page 37) |
| Dataset Splits | No | The paper describes sampling coresets from the full dataset and evaluating performance against the full dataset, but does not provide specific train/test/validation dataset splits. "Using each of the above described algorithm, we first subsample coresets of different sizes...We then use these centers and compute the quantization error (Cs) on the full data set." (Page 23) |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Using each of the above described algorithm, we first subsample coresets of different sizes. Once we have the coreset, we run the weighted k-means++ Arthur & Vassilvitskii (2007) on them to obtain the centers. We then use these centers and compute the quantization error (Cs) on the full data set. We also compute quantization error by running k-means++ on the full data set (Cf). Finally we report the Relative-Error η = |Cs Cf|/Cf. (Page 23) In the figure 1 the Y-axis represents the relative error η and the X-axis represents the expected sample size which is in terms of percentage of the full data. For every expected sample size, we run 10 random instances... For each of the algorithms and for each value of ε we run 5 random instances, compute η = |CS CF | / CF and report the average η value. We consider ε = {1.0, 0.75, 0.5, 0.25}, for which we have {500, 850, 1650, 5500} expected samples. (Page 23-24) |