Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Post-processing Private Synthetic Data for Improving Utility on Selected Measures
Authors: Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms. |
| Researcher Affiliation | Industry | Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava MIT-IBM Watson AI Lab email: EMAIL |
| Pseudocode | Yes | Algorithm 1 Dual variables computation. |
| Open Source Code | No | The paper discusses existing DP mechanisms' implementations from the Open DP library [Sma23] and mentions 'details on algorithm implementation' in supplementary material, but does not provide a direct link or explicit statement about open-sourcing the authors' own code for the proposed methodology. |
| Open Datasets | Yes | We evaluate our algorithm on four benchmark datasets: Adult, Bank, Mushroom, and Shopping, which are from the UCI machine learning repository [DG17]. |
| Dataset Splits | No | Specifically, we split the real data, using 80% for generating synthetic data and setting aside 20% to evaluate the performance of predictive models. This specifies a training and testing split but does not mention a separate validation split. |
| Hardware Specification | Yes | Additionally, our procedure, which includes computing the utility measures from real data, denoising the noisy answers, and computing optimal resampling weights, only takes around 4 mins on 1x NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using 'SDMetrics library [Dat23]' and 'synthcity library [QCvd S23]' for evaluation, and 'Open DP library [Sma23]' for generating synthetic data, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We apply the Gaussian mechanism with (ϵpost = 1, δpost = 1/n2) to estimate utility measures from the real data, where n denotes the number of real data points. Finally, we apply Algorithm 1 with γ = 1e 5, a batch size of 256 for UCI datasets and 4096 for home-credit, and 200 epochs to compute the optimal resampling weights, which are then used to resample from the synthetic data with the same sample size. |