Transferable Perturbations of Deep Feature Distributions

Authors: Nathan Inkawhich, Kevin Liang, Lawrence Carin, Yiran Chen

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended Image Net models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples.
Researcher Affiliation Academia Nathan Inkawhich, Kevin J Liang, Lawrence Carin & Yiran Chen Department of Electrical and Computer Engineering Duke University {nathan.inkawhich,kevin.liang,lcarin,yiran.chen}@duke.edu
Pseudocode No The paper describes the optimization procedure in text (Appendix D) but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include a statement about releasing the code for their method or a direct link to a code repository.
Open Datasets Yes Published as a conference paper at ICLR 2020 trained on Image Net-1k (Deng et al., 2009).
Dataset Splits Yes We use the Image Net-1K validation set as the test dataset.
Hardware Specification No The paper does not specify the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using models from the PyTorch Model Zoo but does not specify version numbers for PyTorch, Python, or other software libraries used for the implementation.
Experiment Setup Yes The architecture of all auxiliary models is the same, regardless of model, layer, or class. Each is a 2-hidden layer NN with a single output unit. There are 200 neurons in each hidden layer and the number of input units matches the size of the input feature map. To train the auxiliary models, unbiased batches from the whole Image Net-1k training set are pushed through the truncated pretrained model (fl), and the extracted features are used to train the auxiliary model parameters. ... all targeted adversarial examples are constrained by ℓ ϵ = 16/255 as described in (Dong et al., 2018; Kurakin et al., 2018). As experimentally found, λ = 0.8 in (2) and η = 1e-6 in (3). Finally, as measured over the initially correctly classified subset of the test dataset (by both the whitebox and blackbox models), attack success is captured in two metrics. Error is the percentage of examples that the blackbox misclassifies and Targeted Success Rate (t Suc) is the percentage of examples that the blackbox misclassifies as the target label.