Data Augmentation MCMC for Bayesian Inference from Privatized Data
Authors: Nianqiao Ju, Jordan Awan, Ruobin Gong, Vinayak Rao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the efficacy and applicability of our methods on a naïve-Bayes log-linear model and on a linear regression model. |
| Researcher Affiliation | Academia | Nianqiao Phyllis Ju Department of Statistics Purdue University West Lafayette, IN 47907 nianqiao@purdue.edu Jordan A. Awan Department of Statistics Purdue University West Lafayette, IN 47907 jawan@purdue.edu Ruobin Gong Department of Statistics Rutgers University Piscataway, NJ 08854 ruobin.gong@rutgers.edu Vinayak A. Rao Department of Statistics Purdue University West Lafayette, IN 47907 varao@purdue.edu |
| Pseudocode | Yes | Algorithm 1 One iteration of the privacy-aware Metropolis-within-Gibbs sampler |
| Open Source Code | No | We will release our code to a public Git Hub repository prior to the conference. |
| Open Datasets | No | The paper describes its simulation setup, stating 'We generate one non-private dataset from the model, and hold it fixed' and 'We generate one confidential dataset (x, y) and hold it fixed.' However, it does not provide concrete access information (link, DOI, formal citation) for these or any other public datasets used in the experiments. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits, percentages, or sample counts. It mentions running chains for a certain number of iterations and discarding burn-in, which is typical for MCMC, but not a traditional data splitting methodology for model validation. |
| Hardware Specification | No | We use an internal cluster for our experiments. A single run with 10,000 iterations takes 2–3 hours for the log-linear model and 3–4 hours for the linear regression model. |
| Software Dependencies | No | The paper states 'We implement our algorithm in Python (version 3.9) using PyTorch (version 1.10.1)' in the supplementary material, which is external to the main paper. Within the main paper itself, there are no specific software dependencies mentioned with version numbers. |
| Experiment Setup | Yes | For the simulation, we set N = 100 (number of records), I = 5 (number of classes), K = 5 (number of features), and Jk = 3 for all k = 1, . . . , K (possible values for each feature). We evaluate our sampler for privacy levels corresponding to 2 {.1, .3, 1, 3, 10}. We discard the first 5000 iterations as burn-in. |