Differentially private Bayesian learning on distributed data
Authors: Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma, Antti Honkela
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the secure DP Bayesian learning scheme in practice by testing the performance of the BLR with data projection, the implementation of which was discussed in Section 3.2.1, along with the DCA (Algorithm 1) in the all Hb C clients distributed setting (T = 0). We use simulated data for the DCA scalability testing, and real data for the BLR tests. As real data, we use the Wine Quality [6] (split into white and red wines) and Abalone data sets from the UCI repository[18], as well as the Genomics of Drug Sensitivity in Cancer (GDSC) project data 2. |
| Researcher Affiliation | Academia | Mikko Heikkilä1 mikko.a.heikkila@helsinki.fi Eemil Lagerspetz2 eemil.lagerspetz@helsinki.fi Samuel Kaski3 samuel.kaski@aalto.fi Kana Shimizu4 shimizu.kana.g@gmail.com Sasu Tarkoma2 sasu.tarkoma@helsinki.fi Antti Honkela1,5 antti.honkela@helsinki.fi 1 Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki 2 Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki 3 Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University 4 Department of Computer Science and Engineering, Waseda University 5 Department of Public Health, University of Helsinki |
| Pseudocode | Yes | Algorithm 1 Distributed Compute Algorithm for distributed summation with independent Compute nodes Input: d-dimensional vectors zi held by clients i {1, . . . , N}; Distributed Gaussian mechanism noise variances σ2 j , j = 1, . . . , d (public); Number of parties N (public); Number of Compute nodes M (public); Output: Differentially private sum PN i=1 (zi + ηi), where ηi N(0, diag(σ2 j )) 1: Each client i simulates ηi N(0, diag(σ2 j )) and M 1 vectors ri,k of uniformly random fixed-point data with ri,M = PM 1 k=1 ri,k to ensure that PM k=1 ri,k = 0d (a vector of zeros). 2: Each client i computes the messages mi,1 = zi + ηi + ri,1, mi,k = ri,k, k = 2, . . . M, and sends them securely to the corresponding Compute k. 3: After receiving messages from all of the clients, Compute k decrypts the values and broadcasts the noisy aggregate sums qk = PN i=1 mi,k. A final aggregator will then add these to obtain PM k=1 qk = PN i=1(zi + ηi). and Algorithm 2 Distributed linear regression with projection Input: Data and target values (xij, yi), j = 1, . . . , d held by clients i {1, . . . , N}; Number of clients N (public); Assumed data and target bounds ( cj, cj), j = 1, . . . , d + 1 (public); Privacy budget (ϵ, δ) (public); Output: DP BLR model sufficient statistics of projected data PN i=1 ˆxiˆx T i +η(1), PN i=1 ˆx T i ˆyi +η(2), calculated using projection to estimated optimal bounds 1: Each client projects his data to the assumed bounds ( cj, cj) j. 2: Calculate marginal std estimates σ(1), . . . , σ(d+1) by running Algorithm 1 using the assumed bounds for sensitivity and a chosen share of the privacy budget. 3: Estimate optimal projection thresholds pj, j = 1, . . . , d + 1 as fractions of std on auxiliary data. Each client then projects his data to the estimated optimal bounds ( pjσ(j), pjσ(j)), j = 1, . . . , d + 1. 4: Aggregate the unique terms in the DP sufficient statistics by running Algorithm 1 using the estimated optimal bounds for sensitivity and the remaining privacy budget, and combine the DP result vectors into the symmetric d d matrix and d-dimensional vector of DP sufficient statistics. |
| Open Source Code | Yes | The source code for our implementation is available through Git Hub1 and a more detailed description can be found in the Supplement. 1https://github.com/DPBayes/dca-nips2017 |
| Open Datasets | Yes | As real data, we use the Wine Quality [6] (split into white and red wines) and Abalone data sets from the UCI repository[18], as well as the Genomics of Drug Sensitivity in Cancer (GDSC) project data 2. ... 2http://www.cancerrxgene.org/, release 6.1, March 2017 |
| Dataset Splits | Yes | For UCI, we compare the median performance measured on mean absolute error over 25 cross-validation (CV) runs, while for GDSC we measure mean prediction accuracy to sensitive vs insensitive with Spearman s rank correlation on 25 CV runs. |
| Hardware Specification | No | The paper mentions 'a modern CPU' for general timing estimation ('running AES for the data of the largest example would take less than 20 s on a single thread on a modern CPU') but does not provide specific hardware details (e.g., CPU model, GPU, RAM) used for running the experiments. |
| Software Dependencies | No | The paper mentions a 'distributed Spark implementation' but does not provide specific version numbers for Spark or any other software dependencies. |
| Experiment Setup | Yes | The optimal projection thresholds are searched for using 10 (GDSC) or 20 (UCI) repeats on a grid with 20 points between 0.1 and 2.1 times the std of the auxiliary data set. In the search we use one common threshold for all data dimensions and a separate one for the target. |