Incentives in Private Collaborative Machine Learning
Authors: Rachael Sim, Yehong Zhang, Nghia Hoang, Xinyi Xu, Bryan Kian Hsiang Low, Patrick Jaillet
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets. This section empirically evaluates the privacy-valuation and privacy-reward trade-offs (Sec. 6.1), reward control mechanisms (Sec. 6.2), and their relationship with the utility of the model rewards (Sec. 6.3). The time complexity of our scheme is analyzed in App. F and baseline methods are discussed in App. H.3. We consider Bayesian linear regression (BLR) with unknown variance on the Syn and Cali H datasets, and Bayesian logistic regression on the Diab dataset with 3 collaborating parties (see App. H.1 for details) and enforce (2, ϵi)-Rényi DP. |
| Researcher Affiliation | Collaboration | 1 Department of Computer Science, National University of Singapore, Republic of Singapore 2 Peng Cheng Laboratory, People s Republic of China 3 School of Electrical Engineering and Computer Science, Washington State University, USA 4 Dept. of Electrical Engineering and Computer Science, MIT, USA |
| Pseudocode | Yes | Algorithm 1 BLR Gibbs sampler [4] from noise-aware posterior p(θ|ON = o N) R Qi N [p(oi|si) p(si|θ)] p(θ) ds1 dsn. The algorithm (repeatedly) sample the latent variables Si, ω and θ sequentially. Algorithm 2 An overview of our collaborative ML problem setup. The computational complexity is given in App. F. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own open-source code for the described methodology. |
| Open Datasets | Yes | We consider Bayesian linear regression (BLR) with unknown variance on the Syn and Cali H datasets, and Bayesian logistic regression on the Diab dataset with 3 collaborating parties (see App. H.1 for details) and enforce (2, ϵi)-Rényi DP. For Californian Housing dataset (Cali H) [44],... For PIMA Indian Diabetes classification dataset (Diab) [50], |
| Dataset Splits | Yes | We split the training and the validation set using an 80-20 split. There are 614 training data points. There are 35.6% and 31.8% of patients with diabetes in the training and validation sets, respectively. |
| Hardware Specification | Yes | The experiments are performed on a machine with Ubuntu 20.04 LTS, 2 Intel Xeon Gold 6230 (2.1GHz) without GPU. |
| Software Dependencies | No | The software environments used are Miniconda and Python. A full list of packages used is given in the file environment.yml attached. |
| Experiment Setup | Yes | The normal inverse-gamma distribution used (i) to generate the true regression model weights, variance, and a 2D dataset and (ii) as our model prior is as follows: σ2 Inv Gamma(α0 = 5, β0 = 0.1) where α0 and β0 are, respectively, the inverse-gamma shape and scale parameters, and w|σ2 N(0, σ2Λ 1 0 ) where Λ0 = 0.025 I. We consider three parties 1, 2, and 3 with c0 = 100, c1 = 200, and c2 = 400 data points, respectively. We fix ϵ1 = ϵ3 = 0.2 and vary ϵ2 from the default 0.1. One posterior sampling run generates 16 Gibbs sampling chains in parallel. For each chain, we discard the first 10000 burn-in samples and draw m = 30000 samples. To reduce the closeness/correlation between samples which will affect the nearest-neighbor-based KL estimation, we thin and only keep every 16-th sample and concatenate the thinned samples across all 16 chains. |