A Separation Result Between Data-oblivious and Data-aware Poisoning Attacks
Authors: Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Abhradeep Guha Thakurta
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we initiate a theoretical study of the problem above. Specifically, for the case of feature selection with LASSO, we show that full information adversaries (that craft poisoning examples based on the rest of the training data) are provably stronger than the optimal attacker that is oblivious to the training set yet has access to the distribution of the data. Our separation result shows that the two setting of data-aware and data-oblivious are fundamentally different and we cannot hope to always achieve the same attack or defense results in these scenarios. and Experiments. To further investigate the power of data-oblivious and data-aware attacks in the context of feature selection, we experiment on synthetic datasets sampled from Gaussian distributions, as suggested in our theoretical results. Our experiments confirm our theoretical findings by showing that the power of data-oblivious and poisoning attacks differ significantly. |
| Researcher Affiliation | Collaboration | Samuel Deng Columbia University samdeng@cs.columbia.edu Sanjam Garg UC Berkeley and NTT Research sanjamg@berkeley.edu Somesh Jha University of Wisconsin jha@cs.wisc.edu Saeed Mahloujifar Princeton sfar@princeton.edu Mohammad Mahmoody University of Virginia mohammad@virginia.edu Abhradeep Thakurta Google Research Brain Team athakurta@google.com |
| Pseudocode | No | The paper describes algorithms (Lasso estimator) and security games but does not provide any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We also consider MNIST and four other datasets used widely in the feature selection literature to explore this separation in real world data: Boston, TOX, Protate_GE, and SMK. 3TOX, SMK, and Prostate_GE can be found here: http://featureselection.asu.edu/datasets.php. Boston can be found with scikit-learn s built-in datasets: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html |
| Dataset Splits | No | The paper mentions training data but does not provide specific details on train/validation/test dataset splits, percentages, or methodology. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run its experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or dependencies (e.g., 'Python 3.x', 'PyTorch 1.x') used in the experiments. |
| Experiment Setup | Yes | For the LASSO algorithm, we use the hyperparameter of λ = 2σ n log p. and We first preprocess the data by standardizing to zero mean and unit variance. Then, we chose λ such that the resulting parameter vector ˆθ has a reasonable support size (at least 10 features in the support); this was done by searching over the space of λ/n [0, 1.0], and resulted in λ = 50.1 for Boston, λ = 9.35 for SMK, λ = 17 for TOX, λ = 5.1 for Prostate, and λ = 1000 for MNIST. |