Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Authors: Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
Researcher Affiliation Industry Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava MIT-IBM Watson AI Lab email: {hao, shiv.sr, john.l.henning, kristjan.h.greenewald, akash.srivastava}@ibm.com
Pseudocode Yes Algorithm 1 Dual variables computation.
Open Source Code No The paper discusses existing DP mechanisms' implementations from the Open DP library [Sma23] and mentions 'details on algorithm implementation' in supplementary material, but does not provide a direct link or explicit statement about open-sourcing the authors' own code for the proposed methodology.
Open Datasets Yes We evaluate our algorithm on four benchmark datasets: Adult, Bank, Mushroom, and Shopping, which are from the UCI machine learning repository [DG17].
Dataset Splits No Specifically, we split the real data, using 80% for generating synthetic data and setting aside 20% to evaluate the performance of predictive models. This specifies a training and testing split but does not mention a separate validation split.
Hardware Specification Yes Additionally, our procedure, which includes computing the utility measures from real data, denoising the noisy answers, and computing optimal resampling weights, only takes around 4 mins on 1x NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies No The paper mentions using 'SDMetrics library [Dat23]' and 'synthcity library [QCvd S23]' for evaluation, and 'Open DP library [Sma23]' for generating synthetic data, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We apply the Gaussian mechanism with (ϵpost = 1, δpost = 1/n2) to estimate utility measures from the real data, where n denotes the number of real data points. Finally, we apply Algorithm 1 with γ = 1e 5, a batch size of 256 for UCI datasets and 4096 for home-credit, and 200 epochs to compute the optimal resampling weights, which are then used to resample from the synthetic data with the same sample size.