Proper Network Interpretability Helps Adversarial Robustness in Classification

Authors: Akhilan Boopathy, Sijia Liu, Gaoyuan Zhang, Cynthia Liu, Pin-Yu Chen, Shiyu Chang, Luca Daniel

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted Image Net. We empirically show that interpretability alone can be used to defend adversarial attacks for both misclassifcation and misinterpretation.
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab, IBM Research.
Pseudocode No The paper describes mathematical formulations and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes Our codes are available at https://github.com/Akhilan B/ Proper-Interpretability
Open Datasets Yes We evaluate networks trained on the MNIST and CIFAR-10 datasets, and a Restricted Image Net (R-Image Net) dataset used in (Tsipras et al., 2019).
Dataset Splits No The paper uses standard datasets like MNIST, CIFAR-10, and Restricted Image Net but does not explicitly provide specific percentages, sample counts, or detailed methodologies for their train/validation/test splits. While it mentions evaluating on '200 random test set points', this does not define the full data partitioning for reproduction.
Hardware Specification Yes Training times are evaluated on a 2.60 GHz Intel Xeon CPU.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup Yes Unless specified otherwise, we choose the perturbation size ϵ = 0.3 on MNIST, 8/255 on CIFAR and 0.003 for RImage Net for robust training under an ℓ perturbation norm. Also, we set the regularization parameter γ as 0.01 in (8); see a justification in Appendix F.