reproducibilityindex.ai

Label differential privacy and private training data release

Authors: Robert Istvan Busa-Fekete, Andres Munoz Medina, Umar Syed, Sergei Vassilvitskii

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our theoretical analysis. In this section we validate the theoretical ﬁndings in our paper by conducting simulations on real-world datasets.
Researcher Affiliation	Industry	1Google Research. Correspondence to: All authors <{busarobi,ammedina,usyed,sergeiv}@google.com>.
Pseudocode	Yes	Mechanism 1 Model-based randomized response on label. Mechanism 2 Model-based double randomized response. Mechanism 3 Uniform randomized response on label (and output dummy features).
Open Source Code	No	The paper does not include any explicit statements about releasing source code or provide links to a code repository.
Open Datasets	Yes	MNIST (Deng, 2012) dataset. We use four large scale binary classiﬁcation datasets, as described in Table 1. Table 1. The main parameters of the benchmark datasets. kag14 dataset used in Kaggle Display Advertising Challenge and it is released by Criteo (Criteo, 2014). kdd12 dataset is the ofﬁcial dataset of KDD Cup 2012 Track 1 (...). kdd10 dataset is the ofﬁcial dataset of KDD Cup 2010 (Stamper and Koedinger, 2010). SUSY is taken from UCI repository.
Dataset Splits	Yes	Table 1. The main parameters of the benchmark datasets. kag14 #Train 40M #Test 5.8M ... For each mechanism M deﬁned above let ( exi, eyi) = M(xi, yi) and e D = ((exi, eyi)) denote the data set released by the mechanism. Let h: X Y denote a model trained on e D. We measure the label inference accuracy as 1 n Pn i=1 1[h( exi) = yi]. To understand the utility of the mechanism, we measure the ability of the learned model h to predict on a test sample ST X Y. For this we use the tesing data of MNIST and measure the test accuracy as 1 n P (x,y) ST 1[h(x) = y].
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies	No	The paper mentions using a "conditional GAN" and a "standard convolutional network" but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train a conditional GAN (Mirza and Osindero, 2014) model on the training dataset. We vary λ (0, 0.4] and generate private datasets using randomized response and double randomized response with parameter λ. For each dataset we computed an approximate k-nearest neighbor graph under the L2 distance, using k = 1000 for SUSY and k = 10000 for the rest.