Post-processing Private Synthetic Data for Improving Utility on Selected Measures
Authors: Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms. |
| Researcher Affiliation | Industry | Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava MIT-IBM Watson AI Lab email: {hao, shiv.sr, john.l.henning, kristjan.h.greenewald, akash.srivastava}@ibm.com |
| Pseudocode | Yes | Algorithm 1 Dual variables computation. |
| Open Source Code | No | The paper discusses existing DP mechanisms' implementations from the Open DP library [Sma23] and mentions 'details on algorithm implementation' in supplementary material, but does not provide a direct link or explicit statement about open-sourcing the authors' own code for the proposed methodology. |
| Open Datasets | Yes | We evaluate our algorithm on four benchmark datasets: Adult, Bank, Mushroom, and Shopping, which are from the UCI machine learning repository [DG17]. |
| Dataset Splits | No | Specifically, we split the real data, using 80% for generating synthetic data and setting aside 20% to evaluate the performance of predictive models. This specifies a training and testing split but does not mention a separate validation split. |
| Hardware Specification | Yes | Additionally, our procedure, which includes computing the utility measures from real data, denoising the noisy answers, and computing optimal resampling weights, only takes around 4 mins on 1x NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using 'SDMetrics library [Dat23]' and 'synthcity library [QCvd S23]' for evaluation, and 'Open DP library [Sma23]' for generating synthetic data, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We apply the Gaussian mechanism with (ϵpost = 1, δpost = 1/n2) to estimate utility measures from the real data, where n denotes the number of real data points. Finally, we apply Algorithm 1 with γ = 1e 5, a batch size of 256 for UCI datasets and 4096 for home-credit, and 200 epochs to compute the optimal resampling weights, which are then used to resample from the synthetic data with the same sample size. |