reproducibilityindex.ai

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Authors: Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
Researcher Affiliation	Industry	Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava MIT-IBM Watson AI Lab email: {hao, shiv.sr, john.l.henning, kristjan.h.greenewald, akash.srivastava}@ibm.com
Pseudocode	Yes	Algorithm 1 Dual variables computation.
Open Source Code	No	The paper discusses existing DP mechanisms' implementations from the Open DP library [Sma23] and mentions 'details on algorithm implementation' in supplementary material, but does not provide a direct link or explicit statement about open-sourcing the authors' own code for the proposed methodology.
Open Datasets	Yes	We evaluate our algorithm on four benchmark datasets: Adult, Bank, Mushroom, and Shopping, which are from the UCI machine learning repository [DG17].
Dataset Splits	No	Specifically, we split the real data, using 80% for generating synthetic data and setting aside 20% to evaluate the performance of predictive models. This specifies a training and testing split but does not mention a separate validation split.
Hardware Specification	Yes	Additionally, our procedure, which includes computing the utility measures from real data, denoising the noisy answers, and computing optimal resampling weights, only takes around 4 mins on 1x NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions using 'SDMetrics library [Dat23]' and 'synthcity library [QCvd S23]' for evaluation, and 'Open DP library [Sma23]' for generating synthetic data, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We apply the Gaussian mechanism with (ϵpost = 1, δpost = 1/n2) to estimate utility measures from the real data, where n denotes the number of real data points. Finally, we apply Algorithm 1 with γ = 1e 5, a batch size of 256 for UCI datasets and 4096 for home-credit, and 200 epochs to compute the optimal resampling weights, which are then used to resample from the synthetic data with the same sample size.