Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms
Authors: Ximing Li, Chendi Wang, Guang Cheng
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the L2 distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every ϵ-DP synthetic data generator. |
| Researcher Affiliation | Academia | Ximing Li Tsinghua University Beijing, 100084, P. R. China {li-xm19}@mails.tsinghua.edu.cn Chendi Wang Shenzhen Research Institute of Big Data & University of Pennsylvania Philadelphia, PA 19104, USA {chendi}@wharton.upenn.edu Guang Cheng Department of Statistics University of California, Los Angeles Los Angeles, CA 90095, USA {guangcheng}@ucla.edu |
| Pseudocode | No | The paper describes algorithms conceptually but does not include any structured pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | No | The paper does not include any statements about releasing open-source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | For instance, in two real datasets ACS (Ruggles et al., 2015) and Adult (Bache & Lichman, 2013), the size n 40, 000, the dimension d 40. |
| Dataset Splits | No | The paper is theoretical and does not describe an experimental setup with specific dataset splits (e.g., training, validation, test percentages or counts) for its own work. It mentions 'Train on Synthetic data and Test on Real data (TSTR, (Esteban et al., 2017))' as an existing utility metric, but this is not a description of its own data partitioning. |
| Hardware Specification | No | The paper is theoretical and does not mention any specific hardware used for experiments (e.g., CPU, GPU models, or cloud computing instances). |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies or version numbers for libraries or tools. |
| Experiment Setup | No | The paper is theoretical and focuses on mathematical derivations and proofs. It does not provide details on experimental setup such as hyperparameters, optimization settings, or training configurations. |