reproducibilityindex.ai

A Separation Result Between Data-oblivious and Data-aware Poisoning Attacks

Authors: Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Abhradeep Guha Thakurta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we initiate a theoretical study of the problem above. Speciﬁcally, for the case of feature selection with LASSO, we show that full information adversaries (that craft poisoning examples based on the rest of the training data) are provably stronger than the optimal attacker that is oblivious to the training set yet has access to the distribution of the data. Our separation result shows that the two setting of data-aware and data-oblivious are fundamentally different and we cannot hope to always achieve the same attack or defense results in these scenarios. and Experiments. To further investigate the power of data-oblivious and data-aware attacks in the context of feature selection, we experiment on synthetic datasets sampled from Gaussian distributions, as suggested in our theoretical results. Our experiments conﬁrm our theoretical ﬁndings by showing that the power of data-oblivious and poisoning attacks differ signiﬁcantly.
Researcher Affiliation	Collaboration	Samuel Deng Columbia University samdeng@cs.columbia.edu Sanjam Garg UC Berkeley and NTT Research sanjamg@berkeley.edu Somesh Jha University of Wisconsin jha@cs.wisc.edu Saeed Mahloujifar Princeton sfar@princeton.edu Mohammad Mahmoody University of Virginia mohammad@virginia.edu Abhradeep Thakurta Google Research Brain Team athakurta@google.com
Pseudocode	No	The paper describes algorithms (Lasso estimator) and security games but does not provide any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets	Yes	We also consider MNIST and four other datasets used widely in the feature selection literature to explore this separation in real world data: Boston, TOX, Protate_GE, and SMK. 3TOX, SMK, and Prostate_GE can be found here: http://featureselection.asu.edu/datasets.php. Boston can be found with scikit-learn s built-in datasets: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
Dataset Splits	No	The paper mentions training data but does not provide specific details on train/validation/test dataset splits, percentages, or methodology.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments, such as GPU/CPU models or memory specifications.
Software Dependencies	No	The paper does not provide specific version numbers for any software components, libraries, or dependencies (e.g., 'Python 3.x', 'PyTorch 1.x') used in the experiments.
Experiment Setup	Yes	For the LASSO algorithm, we use the hyperparameter of λ = 2σ n log p. and We first preprocess the data by standardizing to zero mean and unit variance. Then, we chose λ such that the resulting parameter vector ˆθ has a reasonable support size (at least 10 features in the support); this was done by searching over the space of λ/n [0, 1.0], and resulted in λ = 50.1 for Boston, λ = 9.35 for SMK, λ = 17 for TOX, λ = 5.1 for Prostate, and λ = 1000 for MNIST.