Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bayesian Pseudo Posterior Mechanism under Asymptotic Differential Privacy

Authors: Terrance D. Savitsky, Matthew R.Williams, Jingchen Hu

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate our pseudo posterior mechanism on the sensitive family income variable from the Consumer Expenditure Surveys database published by the U.S. Bureau of Labor Statistics. We show that utility is better preserved in the synthetic data for our pseudo posterior mechanism as compared to the EM, both estimated using the same non-private synthesizer, due to our use of targeted downweighting.
Researcher Affiliation	Collaboration	Terrance D. Savitsky EMAIL Oﬃce of Survey Methods Research U.S. Bureau of Labor Statistics 2 Massachusetts Ave NE Washington, DC 20212, USA Matthew R. Williams EMAIL National Center for Science and Engineering Statistics National Science Foundation 2415 Eisenhower Ave Alexandria, VA 22314, USA Jingchen Hu EMAIL Vassar College 124 Raymond Ave, Box 27 Poughkeepsie, NY 12604, USA
Pseudocode	Yes	1. Compute weights α (a) Let \|fθs,i\| denote the absolute value of the log-likelihood computed from the unweighted pseudo posterior synthesizer for database record, i (1, . . . , n) and MCMC draw, s (1, . . . , S) of θ. (b) Compute the S n matrix of by-record (absolute value of) log-likelihoods, L = {\|fθs,i\|}i=1,...,n, s=1,...,S. (c) Compute the maximum over each S 1 column of L to produce the n 1 (database record-indexed) vector, f = (f1, . . . , fn). We use a linear transformation of each fi to fi [0, 1] where values of fi closer to 1 indicates relatively higher identiﬁcation disclosure risk: fi = fi minj fj maxj fj minj fj . (d) We formulate by-record weights, α = (α1, , αn), αi = c (1 fi) + g, (9) where c and g denote a scaling and a shift parameters, respectively, of the αi used to tune the risk-utility trade-oﬀ. If we set scaling tuning parameter, c = 1 and shift tuning parameter, g = 0, then each αi is simply (1 fi) such that the pseudo likelihood weights are solely a function of the record-indexed log likelihoods. As discussed in Hu et al. (2021), decreasing c < 1 will compress the distribution of the (αi) while setting g < 0 will shift downward the distribution of the weights such that more weights will be close to 0. We use truncation to ensure each αi [0, 1]. These α satisfy a slightly weaker asymptotic form of Assumptions 1 and 2. We will show in Section 5 the eﬀects of diﬀerent conﬁgurations of c and g on the risk and utility proﬁles of the diﬀerentially private synthetic data set for the CE sample, generated under our proposed α weighted pseudo posterior mechanism. 2. Compute Lipschitz bound, α,x (a) Use α = (α1, . . . , αn) to construct the pseudo likelihood of Equation 4 from which the pseudo posterior of Equation 5 is estimated. Draw (θs)s=1,...S from the α weighted pseudo posterior distribution. (b) As earlier, compute the S n matrix of log-pseudo likelihood values, Lα = n \|fα θs,i\| o i=1,...,n, s=1,...,S (c) Compute α,x = maxs,i\|fα θs,i\|. 3. Draw synthetic data, ζℓ, from the pseudo posterior distribution (a) Using the (θs)s=1,...S drawn from the α weighted pseudo posterior distribution estimated in the earlier step, randomly sample ℓ= 1, . . . , (m = 20) parameter values and draw synthetic data value, ζℓ,i ind pθℓ( ) for parameter draw ℓ (1, . . . , m) and database record i (1, . . . , n). This step accomplishes a draw from the pseudo posterior predictive distribution. (b) Release the synthetic data, ζ = (ζ1, , ζm), in place of the closely-held real data, x.
Open Source Code	No	No explicit statement about code release or link to a code repository is provided in the paper.
Open Datasets	Yes	Our application of the α weighted pseudo posterior mechanism focuses on providing privacy protection for a family income variable published by the CE. The CE is administered by the BLS with the purpose of providing income and expenditure patterns indexed by geographic domains to support policy-making by State and Federal governments. ...The CE public-use microdata (PUMD)1 is publicly available record-level data, published by the CE. 1. For for information about CE PUMD, visit https://www.bls.gov/cex/pumd.htm.
Dataset Splits	No	The paper describes generating synthetic datasets from models trained on real data, and a Monte Carlo simulation study involving data generation. It does not provide standard training/validation/test splits of an existing dataset for model evaluation or reproduction purposes. For example, Section 5.2 states, "a set of m = 20 synthetic databases were generated and the distribution for each statistic was estimated on each databases (under re-sampling)." This refers to generating synthetic data, not splitting a single dataset.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments or simulations.
Software Dependencies	No	The paper mentions statistical models and synthesizers but does not provide specific software names with version numbers (e.g., Python, PyTorch, specific statistical packages with versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	1. Compute weights α ... (d) We formulate by-record weights, α = (α1, , αn), αi = c (1 fi) + g, (9) where c and g denote a scaling and a shift parameters, respectively, of the αi used to tune the risk-utility trade-oﬀ. ...Table 2: Table of values of the Lipchitz bound α,x, of the synthesizer under the α weighted pseudo posterior mechanism, for a series of (c, g) conﬁgurations. ...Using the means model for Poisson distributed data, y ~ Pois(µ) (with µ = 100) our simulation procedure is, as follows. ...2. We add a step to truncate the weight for any record whose weighted log-pseudo likelihood value is greater than some threshold, M, to 0... We choose M based on oracle information based on experience with databases of similar types. ...The Weighted-M mechanism, under setting M = 3.5, demonstrates rapid contraction... ...We compute utilities over m = 20 synthetic databases to fully capture the uncertainty in the synthetic data generation process from the (pseudo) posterior predictive distributions.