How Does Data Augmentation Affect Privacy in Machine Learning?
Authors: Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu10746-10753
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that the proposed approach universally outperforms original methods when the model is trained with data augmentation. Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation. |
| Researcher Affiliation | Collaboration | Da Yu*1, Huishuai Zhang2, Wei Chen2, Jian Yin1, Tie-Yan Liu2 1 School of Computer Science and Engineering, Sun Yat-sen University. Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, P.R.China 2 Microsoft Research Asia |
| Pseudocode | Yes | Algorithm 1: Membership inference with average loss values (Mmean). Input :Set of loss values ℓT (θ, d), threshold τ. Output :Boolean value, true denotes d is a member. 1 Compute v = mean(ℓT ). 2 Return v < τ. |
| Open Source Code | Yes | Our source code is publicly available 3https://github.com/dayu11/MI_with_DA |
| Open Datasets | Yes | Datasets We use benchmark datasets for image classification: CIFAR10, CIFAR100, and Image Net1000. CIFAR10 and CIFAR100 both have 60000 examples including 50000 training samples and 10000 test samples. CIFAR10 and CIFAR100 have 10 and 100 classes, respectively. Image Net1000 contains more than one million high-resolution images with 1000 classes. We use the training and validation sets provided by ILSVRC20124. |
| Dataset Splits | Yes | The data are often divided into training set Dtrain and test set Dtest to properly evaluate the model performance on unseen samples. The generalization gap G represents the difference of the model performance between the training set and the test set, G = Ed Dtest[ℓ(θ, d)] Ed Dtrain[ℓ(θ, d)]. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | The small model is trained for 200 epochs with initial learning rate 0.01. We decay the learning rate by 10 at the 100-th epoch. |