Data Augmentation MCMC for Bayesian Inference from Privatized Data

Authors: Nianqiao Ju, Jordan Awan, Ruobin Gong, Vinayak Rao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the efficacy and applicability of our methods on a naïve-Bayes log-linear model and on a linear regression model.
Researcher Affiliation Academia Nianqiao Phyllis Ju Department of Statistics Purdue University West Lafayette, IN 47907 nianqiao@purdue.edu Jordan A. Awan Department of Statistics Purdue University West Lafayette, IN 47907 jawan@purdue.edu Ruobin Gong Department of Statistics Rutgers University Piscataway, NJ 08854 ruobin.gong@rutgers.edu Vinayak A. Rao Department of Statistics Purdue University West Lafayette, IN 47907 varao@purdue.edu
Pseudocode Yes Algorithm 1 One iteration of the privacy-aware Metropolis-within-Gibbs sampler
Open Source Code No We will release our code to a public Git Hub repository prior to the conference.
Open Datasets No The paper describes its simulation setup, stating 'We generate one non-private dataset from the model, and hold it fixed' and 'We generate one confidential dataset (x, y) and hold it fixed.' However, it does not provide concrete access information (link, DOI, formal citation) for these or any other public datasets used in the experiments.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits, percentages, or sample counts. It mentions running chains for a certain number of iterations and discarding burn-in, which is typical for MCMC, but not a traditional data splitting methodology for model validation.
Hardware Specification No We use an internal cluster for our experiments. A single run with 10,000 iterations takes 2–3 hours for the log-linear model and 3–4 hours for the linear regression model.
Software Dependencies No The paper states 'We implement our algorithm in Python (version 3.9) using PyTorch (version 1.10.1)' in the supplementary material, which is external to the main paper. Within the main paper itself, there are no specific software dependencies mentioned with version numbers.
Experiment Setup Yes For the simulation, we set N = 100 (number of records), I = 5 (number of classes), K = 5 (number of features), and Jk = 3 for all k = 1, . . . , K (possible values for each feature). We evaluate our sampler for privacy levels corresponding to 2 {.1, .3, 1, 3, 10}. We discard the first 5000 iterations as burn-in.