AutoCoreset: An Automatic Practical Coreset Construction Framework

Authors: Alaa Maalouf, Murad Tukan, Vladimir Braverman, Daniela Rus

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experimental Study In what follows, we first discuss the choices of different vector summarization coresets, and the used parameters in our experiments, followed by evaluating our coreset on real-world datasets, against other famous competing methods: Near Convex Coreset (Tukan et al., 2020), Lewis weights (Munteanu et al., 2018) and leverage scores (Munteanu et al., 2018) for logistic regression, Near Convex Coreset (Tukan et al., 2020) and optimization based coreset (Tukan et al., 2021) for support vector machines (SVM), SVD-based coreset (Maalouf et al., 2020) for linear regression, Bi criteria coreset (Braverman et al., 2021) for k-means, and uniform sampling in all of the experiments. We note that each experiment was conducted for 16 trials, we report both the mean and std for all of the presented metrics.
Researcher Affiliation Collaboration 1CSAIL, MIT, Cambridge, USA. 2Data Heroes, Israel. 3Department of Computer science, Rice university.
Pseudocode Yes Algorithm 1 AUTOCORESET (P, f, τ, m, ζ) input set of n points P, a loss function f, a coreset size τ, number of initial models m, and an stopping criterion ζ output A coreset (I, v) such that
Open Source Code Yes (iii) Auto Coreset: An open-source code implementation of our algorithm for reproducing our results and future research. For simplicity and ease of access, to obtain a coreset, the user only needs to plug in his desired loss function and the data into our system. We believe this system will popularize and expose the use of coresets to other scientific fields.
Open Datasets Yes Datasets. The following datasets were used throughout our experimentation. These datasets were taken from (Dua & Graff, 2017) and (Chang & Lin, 2011): (i) Credit card dataset (Yeh & Lien, 2009) composed of 30000 points with 24 features representing customers default payments in Taiwan, (ii) Cod-RNA dataset (Uzilov et al., 2006): dataset containing 59535 points with 8 features, (iii) HTRU dataset (Lyon et al., 2016): Pulsar candidates collected during the HTRU survey containing 17898 each with 9 features, (iv) 3D Road Network (Guo et al., 2012): 3D road network with highly accurate elevation information from Denmark containing 434874 points each with 4 features, (v) Accelerometer dataset (Sampaio et al., 2019): an accelerometer data from vibrations of a cooler fan with weights on its blades containing 153000 points consisting each of 5 features, and (vi) Energy efficiency Data Set (Tsanas & Xifara, 2012): a dataset containing 768 points each of 8 features.
Dataset Splits No The paper mentions "test accuracy" and how sample sizes are split for unbalanced datasets, but it does not provide explicit details about train/validation/test splits or cross-validation methodology for the datasets used.
Hardware Specification Yes Tests were performed on 2.59GHz i7-6500U (2 cores total) machine with 16GB RAM.
Software Dependencies No Our algorithms were implemented in Python 3.9 (Van Rossum & Drake, 2009) using Numpy (Oliphant, 2006), Scipy (Virtanen et al., 2020) and Scikit-learn (Pedregosa et al., 2011). While Python 3.9 is specified, specific version numbers for Numpy, Scipy, and Scikit-learn are not provided, only their general citations.
Experiment Setup Yes Setting the number of initial solutions m. Throughout our experiments, we have set the number of initial solutions to 10. Choosing a stopping criterion ζ. Inspired by the early-stopping mechanism of (Prechelt, 1998), we adopt a similar idea. We make use of a parameter, namely patience , which was set to 7, to attempt an indication of the occurrence of saturation with respect to the exposure of our coreset paradigm to new queries; see more details at Section A. To correctly use this parameter, we use additional two parameters, one of which is a counter, while the other holds the optimal coreset that resulted in the smallest sum of the entries of the concatenated columns (see Line 13 at Algorithm 1). The counter will be reset to 0 once a new column is added such that its sum is lower than the smallest sum so far, and the optimal coreset will be updated. Otherwise, the counter will be increased. AUTOCORESET will keep running until the above counter reaches the patience parameter. In our experiments, we returned the optimal coreset since it led to better results. For completeness, we refer the reader to the appendix where we conduct an ablation study and check our results without taking the optimal coreset, i.e., in those results, we take the last coreset. Note that, in both sets of experiments, we outperform the competing methods. Logistic regression. We have set the maximal number of iterations to 1000 (for the Scikit-Learn solver) while setting the regularization parameter to 1. Our system s approximation error was smaller by orders of magnitude, and the accuracy associated with the models trained using our coreset was better than the model trained on the competing methods; see Figure 3(a) and Figure 5(a). On the other hand, Figure 4(a) depicts a multiplicative gap of 30 with respect to the approximation error in comparison to the competing methods while simultaneously acceding by 5% accuracy gap over them. In addition, we present the confusion matrix for each of our coresets using Auto Coreset, and compare it to the confusion matrices with respect to the entire data and the uniform sampling coreset; See Figure 7. The confusion matrices aim towards explaining our advantage as our system outputs coresets that approximately maintain the structural properties of the confusion matrix of the entire data better than simply using uniform sampling, as our recall and accuracy are closer to their corresponding values when using the entire dataset. SVMs. As for SVMs, we mainly focused on the linear kernel, while setting the regularization parameter to 1. Similarly to logistic regression, we outperform the competing methods both in accuracy and approximation error; see Figure 3(b) and Figure 5(b).