Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

Authors: Ximing Li, Chendi Wang, Guang Cheng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the L2 distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every ϵ-DP synthetic data generator.
Researcher Affiliation Academia Ximing Li Tsinghua University Beijing, 100084, P. R. China {li-xm19}@mails.tsinghua.edu.cn Chendi Wang Shenzhen Research Institute of Big Data & University of Pennsylvania Philadelphia, PA 19104, USA {chendi}@wharton.upenn.edu Guang Cheng Department of Statistics University of California, Los Angeles Los Angeles, CA 90095, USA {guangcheng}@ucla.edu
Pseudocode No The paper describes algorithms conceptually but does not include any structured pseudocode blocks or sections labeled 'Algorithm'.
Open Source Code No The paper does not include any statements about releasing open-source code or provide a link to a code repository for the methodology described.
Open Datasets Yes For instance, in two real datasets ACS (Ruggles et al., 2015) and Adult (Bache & Lichman, 2013), the size n 40, 000, the dimension d 40.
Dataset Splits No The paper is theoretical and does not describe an experimental setup with specific dataset splits (e.g., training, validation, test percentages or counts) for its own work. It mentions 'Train on Synthetic data and Test on Real data (TSTR, (Esteban et al., 2017))' as an existing utility metric, but this is not a description of its own data partitioning.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for experiments (e.g., CPU, GPU models, or cloud computing instances).
Software Dependencies No The paper is theoretical and does not list any specific software dependencies or version numbers for libraries or tools.
Experiment Setup No The paper is theoretical and focuses on mathematical derivations and proofs. It does not provide details on experimental setup such as hyperparameters, optimization settings, or training configurations.