reproducibilityindex.ai

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Authors: Sikha Pentyala, Mayana Pereira, Martine De Cock

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate these pipelines for performance, measuring runtime and communication cost. We assess the quality of the synthetic data by reporting the average workload error on benchmark datasets. We also assess the utility of the generated synthetic data for downstream machine learning (ML) tasks by reporting AUC and F1 scores obtained with logistic regression and random forest models trained on the benchmark data. Our evaluations demonstrate that Ca PS generates differentially private synthetic data of the same level of quality as in the centralized paradigm, while in addition providing input privacy, making it suitable for collaborative SDG from distributed data holders.
Researcher Affiliation	Academia	1School of Engineering and Technology, University of Washington Tacoma, USA 2Department of Electrical Engineering, Universidade de Brasilia, Brazil 3Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Belgium.
Pseudocode	Yes	Algorithm 1 Ca PS: Generating tabular synthetic data with DP-in-MPC in the select-measure-generate template; Protocol 2 πCOMP: MPC Protocol for COMPUTE ANSWERS; Protocol 3 πSELECT: MPC Protocol for SELECT; Protocol 4 πRC: MPC Protocol for random selection; Protocol 5 πMEASURE: MPC Protocol for MEASURE using Gaussian noise; Protocol 6 πSELECT: MPC Protocol for SELECT for AIM; Protocol 7 πL1 NORM: MPC Protocol to compute L1-norm; Protocol 8 πSELECT: MPC Protocol for SELECT for MWEM+PGM; Protocol 9 πGSS: Box-Muller to generate Guassian sample with mean 0 and variance 1; Protocol 10 πSELECT: MPC Protocol for MEASURE using Laplacian noise; Protocol 11 MPC Protocol to compute p-way marginals
Open Source Code	Yes	We have made the code available at https://github.com/sikhapentyala/MPC_SDG/tree/icml
Open Datasets	Yes	Datasets. We evaluate Ca PS on three datasets: breastcancer (Zwitter & Soklic, 1988), prison recidivism (COMPAS)7 (Angwin et al., 2016), and diabetes (Smith et al., 1988). The breast-cancer dataset has 10 categorical attributes and 285 samples. The COMPAS data consists of categorical data. We utilize the same version as in (Calmon et al., 2017), which consists of 7 categorical features and 7,214 samples. The diabetes dataset has 9 continuous attributes and 768 samples. We use the train sets to generate synthetic data and the test sets to evaluate the quality of the synthetic data.
Dataset Splits	Yes	We randomly split all the datasets into train and test in an 80% to 20% ratio.
Hardware Specification	Yes	To measure the average time to generate synthetic data in CDP and Ca PS, we run experiments in a simulated environment on Azure D8ads v5 8 v CPUs, 32Gib RAM. ... In Table 3, we evaluate individual MPC subprotocols for different threat models when they are run on independent instances of Azure Standard F16s v2 (16 vcpus, 32 Gi B memory) and network bandwidth of 12.5Gbps.
Software Dependencies	Yes	We implemented the MPC protocols of Ca PS in MP-SPDZ (Keller, 2020b).
Experiment Setup	Yes	Synthetic data was generated with ϵ = 1.0. ... We train logistic regression and random forest models on the generated synthetic datasets and report the AUC-ROC and F1 score. ... S1 begins with randomly initializing ˆD as defined by INIT() on Line 2. ... For the distributed scenario with Ca PS, N = 2 and M = 3 (3PC passive (Araki et al., 2016)). ... We run experiments in a simulated environment on Azure D8ads v5 8 v CPUs, 32Gib RAM.