Coresets for Relational Data and The Applications

Authors: Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, Hu Ding

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of our relational coreset on three popular machine learning problems, the SVM with soft margin (α = O( θ 2), β = 0, and z = 1), the kc-means clustering3 (α = O( 1 ϵ ), β = ϵ, and z = 2, where ϵ can be any small number in (0, 1)), and the logistic regression (α = O( θ 2), β = 0, and z = 1). All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory.
Researcher Affiliation Academia Jiaxiang Chen1 Qingyuan Yang2 Ruomin Huang1 Hu Ding 2 1School of Data Science 2School of Computer Science and Technology University of Science and Technology of China {czar, yangqingyuan, hrm}@mail.ustc.edu.cn, huding@ustc.edu.cn
Pseudocode Yes Algorithm 1 AGGREGATION TREE Input: A set of relational tables {T1, . . . , Ts} and a parameter k (the coreset size). Output: A weighted point set as the coreset.
Open Source Code Yes We release our codes at Github [28]. [28] Github. Source code of our relational coreset approach. https://github.com/cjx-zar/ Coresets-for-Relational-Data-and-The-Applications.
Open Datasets Yes We design four different join queries (Q1-Q4) on three real relational data sets. Q1 and Q2 are designed on a labeled data set HOME CREDIT [31], and we use them to solve the SVM and logistic regression problems. Q3 and Q4 are foreign key joins [5] designed on the unlabeled data sets YELP [58] and FAVORITA [26] respectively, and we use them to solve the kc-means clustering problem.
Dataset Splits No The paper describes the datasets used (HOME CREDIT, YELP, FAVORITA) and the machine learning problems solved on them, but it does not provide specific details on how these datasets were split into training, validation, or test sets within the main text.
Hardware Specification Yes All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory.
Software Dependencies Yes Our algorithms were implemented in Python with Postgre SQL 12.10.
Experiment Setup No The paper describes the machine learning problems and baselines used, and notes the end-to-end runtime and optimization quality metrics. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or system-level training settings for the models. It states that "detailed experimental results are in our supplement" and "training details (e.g., data splits, hyperparameters, how they were chosen)" were provided.