Label differential privacy and private training data release
Authors: Robert Istvan Busa-Fekete, Andres Munoz Medina, Umar Syed, Sergei Vassilvitskii
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our theoretical analysis. In this section we validate the theoretical findings in our paper by conducting simulations on real-world datasets. |
| Researcher Affiliation | Industry | 1Google Research. Correspondence to: All authors <{busarobi,ammedina,usyed,sergeiv}@google.com>. |
| Pseudocode | Yes | Mechanism 1 Model-based randomized response on label. Mechanism 2 Model-based double randomized response. Mechanism 3 Uniform randomized response on label (and output dummy features). |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | MNIST (Deng, 2012) dataset. We use four large scale binary classification datasets, as described in Table 1. Table 1. The main parameters of the benchmark datasets. kag14 dataset used in Kaggle Display Advertising Challenge and it is released by Criteo (Criteo, 2014). kdd12 dataset is the official dataset of KDD Cup 2012 Track 1 (...). kdd10 dataset is the official dataset of KDD Cup 2010 (Stamper and Koedinger, 2010). SUSY is taken from UCI repository. |
| Dataset Splits | Yes | Table 1. The main parameters of the benchmark datasets. kag14 #Train 40M #Test 5.8M ... For each mechanism M defined above let ( exi, eyi) = M(xi, yi) and e D = ((exi, eyi)) denote the data set released by the mechanism. Let h: X Y denote a model trained on e D. We measure the label inference accuracy as 1 n Pn i=1 1[h( exi) = yi]. To understand the utility of the mechanism, we measure the ability of the learned model h to predict on a test sample ST X Y. For this we use the tesing data of MNIST and measure the test accuracy as 1 n P (x,y) ST 1[h(x) = y]. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. |
| Software Dependencies | No | The paper mentions using a "conditional GAN" and a "standard convolutional network" but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train a conditional GAN (Mirza and Osindero, 2014) model on the training dataset. We vary λ (0, 0.4] and generate private datasets using randomized response and double randomized response with parameter λ. For each dataset we computed an approximate k-nearest neighbor graph under the L2 distance, using k = 1000 for SUSY and k = 10000 for the rest. |