Coresets for Relational Data and The Applications
Authors: Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, Hu Ding
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of our relational coreset on three popular machine learning problems, the SVM with soft margin (α = O( θ 2), β = 0, and z = 1), the kc-means clustering3 (α = O( 1 ϵ ), β = ϵ, and z = 2, where ϵ can be any small number in (0, 1)), and the logistic regression (α = O( θ 2), β = 0, and z = 1). All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory. |
| Researcher Affiliation | Academia | Jiaxiang Chen1 Qingyuan Yang2 Ruomin Huang1 Hu Ding 2 1School of Data Science 2School of Computer Science and Technology University of Science and Technology of China {czar, yangqingyuan, hrm}@mail.ustc.edu.cn, huding@ustc.edu.cn |
| Pseudocode | Yes | Algorithm 1 AGGREGATION TREE Input: A set of relational tables {T1, . . . , Ts} and a parameter k (the coreset size). Output: A weighted point set as the coreset. |
| Open Source Code | Yes | We release our codes at Github [28]. [28] Github. Source code of our relational coreset approach. https://github.com/cjx-zar/ Coresets-for-Relational-Data-and-The-Applications. |
| Open Datasets | Yes | We design four different join queries (Q1-Q4) on three real relational data sets. Q1 and Q2 are designed on a labeled data set HOME CREDIT [31], and we use them to solve the SVM and logistic regression problems. Q3 and Q4 are foreign key joins [5] designed on the unlabeled data sets YELP [58] and FAVORITA [26] respectively, and we use them to solve the kc-means clustering problem. |
| Dataset Splits | No | The paper describes the datasets used (HOME CREDIT, YELP, FAVORITA) and the machine learning problems solved on them, but it does not provide specific details on how these datasets were split into training, validation, or test sets within the main text. |
| Hardware Specification | Yes | All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory. |
| Software Dependencies | Yes | Our algorithms were implemented in Python with Postgre SQL 12.10. |
| Experiment Setup | No | The paper describes the machine learning problems and baselines used, and notes the end-to-end runtime and optimization quality metrics. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or system-level training settings for the models. It states that "detailed experimental results are in our supplement" and "training details (e.g., data splits, hyperparameters, how they were chosen)" were provided. |