reproducibilityindex.ai

Coresets for Relational Data and The Applications

Authors: Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, Hu Ding

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of our relational coreset on three popular machine learning problems, the SVM with soft margin (α = O( θ 2), β = 0, and z = 1), the kc-means clustering3 (α = O( 1 ϵ ), β = ϵ, and z = 2, where ϵ can be any small number in (0, 1)), and the logistic regression (α = O( θ 2), β = 0, and z = 1). All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory.
Researcher Affiliation	Academia	Jiaxiang Chen1 Qingyuan Yang2 Ruomin Huang1 Hu Ding 2 1School of Data Science 2School of Computer Science and Technology University of Science and Technology of China {czar, yangqingyuan, hrm}@mail.ustc.edu.cn, huding@ustc.edu.cn
Pseudocode	Yes	Algorithm 1 AGGREGATION TREE Input: A set of relational tables {T1, . . . , Ts} and a parameter k (the coreset size). Output: A weighted point set as the coreset.
Open Source Code	Yes	We release our codes at Github [28]. [28] Github. Source code of our relational coreset approach. https://github.com/cjx-zar/ Coresets-for-Relational-Data-and-The-Applications.
Open Datasets	Yes	We design four different join queries (Q1-Q4) on three real relational data sets. Q1 and Q2 are designed on a labeled data set HOME CREDIT [31], and we use them to solve the SVM and logistic regression problems. Q3 and Q4 are foreign key joins [5] designed on the unlabeled data sets YELP [58] and FAVORITA [26] respectively, and we use them to solve the kc-means clustering problem.
Dataset Splits	No	The paper describes the datasets used (HOME CREDIT, YELP, FAVORITA) and the machine learning problems solved on them, but it does not provide specific details on how these datasets were split into training, validation, or test sets within the main text.
Hardware Specification	Yes	All the experimental results were obtained on a server equipped with 3.0GHz Intel CPUs and 384GB main memory.
Software Dependencies	Yes	Our algorithms were implemented in Python with Postgre SQL 12.10.
Experiment Setup	No	The paper describes the machine learning problems and baselines used, and notes the end-to-end runtime and optimization quality metrics. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or system-level training settings for the models. It states that "detailed experimental results are in our supplement" and "training details (e.g., data splits, hyperparameters, how they were chosen)" were provided.