reproducibilityindex.ai

Differentially private Bayesian learning on distributed data

Authors: Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma, Antti Honkela

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the secure DP Bayesian learning scheme in practice by testing the performance of the BLR with data projection, the implementation of which was discussed in Section 3.2.1, along with the DCA (Algorithm 1) in the all Hb C clients distributed setting (T = 0). We use simulated data for the DCA scalability testing, and real data for the BLR tests. As real data, we use the Wine Quality [6] (split into white and red wines) and Abalone data sets from the UCI repository[18], as well as the Genomics of Drug Sensitivity in Cancer (GDSC) project data 2.
Researcher Affiliation	Academia	Mikko Heikkilä1 mikko.a.heikkila@helsinki.fi Eemil Lagerspetz2 eemil.lagerspetz@helsinki.fi Samuel Kaski3 samuel.kaski@aalto.fi Kana Shimizu4 shimizu.kana.g@gmail.com Sasu Tarkoma2 sasu.tarkoma@helsinki.fi Antti Honkela1,5 antti.honkela@helsinki.fi 1 Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki 2 Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki 3 Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University 4 Department of Computer Science and Engineering, Waseda University 5 Department of Public Health, University of Helsinki
Pseudocode	Yes	Algorithm 1 Distributed Compute Algorithm for distributed summation with independent Compute nodes Input: d-dimensional vectors zi held by clients i {1, . . . , N}; Distributed Gaussian mechanism noise variances σ2 j , j = 1, . . . , d (public); Number of parties N (public); Number of Compute nodes M (public); Output: Differentially private sum PN i=1 (zi + ηi), where ηi N(0, diag(σ2 j )) 1: Each client i simulates ηi N(0, diag(σ2 j )) and M 1 vectors ri,k of uniformly random ﬁxed-point data with ri,M = PM 1 k=1 ri,k to ensure that PM k=1 ri,k = 0d (a vector of zeros). 2: Each client i computes the messages mi,1 = zi + ηi + ri,1, mi,k = ri,k, k = 2, . . . M, and sends them securely to the corresponding Compute k. 3: After receiving messages from all of the clients, Compute k decrypts the values and broadcasts the noisy aggregate sums qk = PN i=1 mi,k. A ﬁnal aggregator will then add these to obtain PM k=1 qk = PN i=1(zi + ηi). and Algorithm 2 Distributed linear regression with projection Input: Data and target values (xij, yi), j = 1, . . . , d held by clients i {1, . . . , N}; Number of clients N (public); Assumed data and target bounds ( cj, cj), j = 1, . . . , d + 1 (public); Privacy budget (ϵ, δ) (public); Output: DP BLR model sufﬁcient statistics of projected data PN i=1 ˆxiˆx T i +η(1), PN i=1 ˆx T i ˆyi +η(2), calculated using projection to estimated optimal bounds 1: Each client projects his data to the assumed bounds ( cj, cj) j. 2: Calculate marginal std estimates σ(1), . . . , σ(d+1) by running Algorithm 1 using the assumed bounds for sensitivity and a chosen share of the privacy budget. 3: Estimate optimal projection thresholds pj, j = 1, . . . , d + 1 as fractions of std on auxiliary data. Each client then projects his data to the estimated optimal bounds ( pjσ(j), pjσ(j)), j = 1, . . . , d + 1. 4: Aggregate the unique terms in the DP sufﬁcient statistics by running Algorithm 1 using the estimated optimal bounds for sensitivity and the remaining privacy budget, and combine the DP result vectors into the symmetric d d matrix and d-dimensional vector of DP sufﬁcient statistics.
Open Source Code	Yes	The source code for our implementation is available through Git Hub1 and a more detailed description can be found in the Supplement. 1https://github.com/DPBayes/dca-nips2017
Open Datasets	Yes	As real data, we use the Wine Quality [6] (split into white and red wines) and Abalone data sets from the UCI repository[18], as well as the Genomics of Drug Sensitivity in Cancer (GDSC) project data 2. ... 2http://www.cancerrxgene.org/, release 6.1, March 2017
Dataset Splits	Yes	For UCI, we compare the median performance measured on mean absolute error over 25 cross-validation (CV) runs, while for GDSC we measure mean prediction accuracy to sensitive vs insensitive with Spearman s rank correlation on 25 CV runs.
Hardware Specification	No	The paper mentions 'a modern CPU' for general timing estimation ('running AES for the data of the largest example would take less than 20 s on a single thread on a modern CPU') but does not provide specific hardware details (e.g., CPU model, GPU, RAM) used for running the experiments.
Software Dependencies	No	The paper mentions a 'distributed Spark implementation' but does not provide specific version numbers for Spark or any other software dependencies.
Experiment Setup	Yes	The optimal projection thresholds are searched for using 10 (GDSC) or 20 (UCI) repeats on a grid with 20 points between 0.1 and 2.1 times the std of the auxiliary data set. In the search we use one common threshold for all data dimensions and a separate one for the target.