Incentives in Private Collaborative Machine Learning

Authors: Rachael Sim, Yehong Zhang, Nghia Hoang, Xinyi Xu, Bryan Kian Hsiang Low, Patrick Jaillet

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets. This section empirically evaluates the privacy-valuation and privacy-reward trade-offs (Sec. 6.1), reward control mechanisms (Sec. 6.2), and their relationship with the utility of the model rewards (Sec. 6.3). The time complexity of our scheme is analyzed in App. F and baseline methods are discussed in App. H.3. We consider Bayesian linear regression (BLR) with unknown variance on the Syn and Cali H datasets, and Bayesian logistic regression on the Diab dataset with 3 collaborating parties (see App. H.1 for details) and enforce (2, ϵi)-Rényi DP.
Researcher Affiliation Collaboration 1 Department of Computer Science, National University of Singapore, Republic of Singapore 2 Peng Cheng Laboratory, People s Republic of China 3 School of Electrical Engineering and Computer Science, Washington State University, USA 4 Dept. of Electrical Engineering and Computer Science, MIT, USA
Pseudocode Yes Algorithm 1 BLR Gibbs sampler [4] from noise-aware posterior p(θ|ON = o N) R Qi N [p(oi|si) p(si|θ)] p(θ) ds1 dsn. The algorithm (repeatedly) sample the latent variables Si, ω and θ sequentially. Algorithm 2 An overview of our collaborative ML problem setup. The computational complexity is given in App. F.
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code for the described methodology.
Open Datasets Yes We consider Bayesian linear regression (BLR) with unknown variance on the Syn and Cali H datasets, and Bayesian logistic regression on the Diab dataset with 3 collaborating parties (see App. H.1 for details) and enforce (2, ϵi)-Rényi DP. For Californian Housing dataset (Cali H) [44],... For PIMA Indian Diabetes classification dataset (Diab) [50],
Dataset Splits Yes We split the training and the validation set using an 80-20 split. There are 614 training data points. There are 35.6% and 31.8% of patients with diabetes in the training and validation sets, respectively.
Hardware Specification Yes The experiments are performed on a machine with Ubuntu 20.04 LTS, 2 Intel Xeon Gold 6230 (2.1GHz) without GPU.
Software Dependencies No The software environments used are Miniconda and Python. A full list of packages used is given in the file environment.yml attached.
Experiment Setup Yes The normal inverse-gamma distribution used (i) to generate the true regression model weights, variance, and a 2D dataset and (ii) as our model prior is as follows: σ2 Inv Gamma(α0 = 5, β0 = 0.1) where α0 and β0 are, respectively, the inverse-gamma shape and scale parameters, and w|σ2 N(0, σ2Λ 1 0 ) where Λ0 = 0.025 I. We consider three parties 1, 2, and 3 with c0 = 100, c1 = 200, and c2 = 400 data points, respectively. We fix ϵ1 = ϵ3 = 0.2 and vary ϵ2 from the default 0.1. One posterior sampling run generates 16 Gibbs sampling chains in parallel. For each chain, we discard the first 10000 burn-in samples and draw m = 30000 samples. To reduce the closeness/correlation between samples which will affect the nearest-neighbor-based KL estimation, we thin and only keep every 16-th sample and concatenate the thinned samples across all 16 chains.