reproducibilityindex.ai

How Does Data Augmentation Affect Privacy in Machine Learning?

Authors: Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu10746-10753

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that the proposed approach universally outperforms original methods when the model is trained with data augmentation. Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation.
Researcher Affiliation	Collaboration	Da Yu*1, Huishuai Zhang2, Wei Chen2, Jian Yin1, Tie-Yan Liu2 1 School of Computer Science and Engineering, Sun Yat-sen University. Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, P.R.China 2 Microsoft Research Asia
Pseudocode	Yes	Algorithm 1: Membership inference with average loss values (Mmean). Input :Set of loss values ℓT (θ, d), threshold τ. Output :Boolean value, true denotes d is a member. 1 Compute v = mean(ℓT ). 2 Return v < τ.
Open Source Code	Yes	Our source code is publicly available 3https://github.com/dayu11/MI_with_DA
Open Datasets	Yes	Datasets We use benchmark datasets for image classiﬁcation: CIFAR10, CIFAR100, and Image Net1000. CIFAR10 and CIFAR100 both have 60000 examples including 50000 training samples and 10000 test samples. CIFAR10 and CIFAR100 have 10 and 100 classes, respectively. Image Net1000 contains more than one million high-resolution images with 1000 classes. We use the training and validation sets provided by ILSVRC20124.
Dataset Splits	Yes	The data are often divided into training set Dtrain and test set Dtest to properly evaluate the model performance on unseen samples. The generalization gap G represents the difference of the model performance between the training set and the test set, G = Ed Dtest[ℓ(θ, d)] Ed Dtrain[ℓ(θ, d)].
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	The small model is trained for 200 epochs with initial learning rate 0.01. We decay the learning rate by 10 at the 100-th epoch.