Differentially Private Distributed Data Summarization under Covariate Shift

Authors: Kanthi Sarpatwar, Karthikeyan Shanmugam, Venkata Sitaramagiridharganesh Ganapavarapu, Ashish Jagmohan, Roman Vaculin

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Apart from theoretical guarantees, we demonstrate the efficacy of our protocol using real-world datasets. We show through empirical evaluation that this is indeed the case; by examining generalization error on two example tasks, we show that the protocol pays only a small price for its differential privacy guarantees. 4 Experimental Evaluation Experiments on Real World Datasets: We now back our theoretical results with empirical experiments. We compare three algorithms: a) Non-Private Greedy, where the aggregator broadcasts the (exact) average of the hashed summary set (i.e., W (Ds,i)q ) and hashed validation set (i.e., W (Dv,i)m ). This is equivalent to the approach of Kim et al. (2016). b) Private Greedy, which is the Algorithm 3 with parameters set as above. c) Uniform Sampling, where we draw equal number pK of required samples from each data provider to construct a summary of size p. We empirically show that private greedy closely matches the performance of non-private greedy even under the strong differential privacy constraints. For comparison, we show that our algorithm outperforms uniform sampling. The motivation for choosing the latter as a candidate comes from the typical manner of using stochastic gradient descent approaches such as Federated Learning [Mc Mahan et al. (2016)] that perform uniform sampling. We experiment with two real world datasets.
Researcher Affiliation Industry Kanthi K. Sarpatwar IBM Research sarpatwa@us.ibm.com, Karthikeyan Shanmugam IBM Research AI karthikeyan.shanmugam2@ibm.com, Venkata Sitaramagiridharganesh Ganapavarapu IBM Research giridhar.ganapavarapu@ibm.com, Ashish Jagmohan IBM Research ashishja@us.ibm.com, Roman Vaculin IBM Research vaculin@us.ibm.com
Pseudocode Yes Algorithm 1: Computing the hash function h1( )., Algorithm 2: Computing the hash function h2( )., Algorithm 3: Description of the protocol.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We experiment with two real world datasets. We discuss one of them, which is based on an Allstate insurance dataset from a Kaggle (2014) competition. We show similar results for the MNIST dataset, that contains image data for recognizing hand written digits, in the Appendix G.
Dataset Splits Yes Validation data: From the remaining 30% of Connecticut data we choose 25% of data as the validation data set.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory specifications, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions software like Linear SVM, Xgboost, Decision Tree, and SVMs but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We set the parameters of our algorithm as follows: the RBF kernel parameter γ = 0.1, dimension of Rahimi Recht hash function h1(.) as d = 140. We use two different T parameters for different epochs given by Tinit(= T, = 1) = d1.5 = 1656 and Tsubs(= T, > 1) = 5. "v = 0.01 is the parameter for h2( ) for the validation set and " ,T is set for h2( ) on summaries Ds over epochs as 0.05 for = 1, 0.01 pp Tsubs for > 1.