Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training
Authors: Xi Wu, Uyeong Jang, Jiefeng Chen, Lingjiao Chen, Somesh Jha
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a detailed empirical study over CIFAR10 for ℓ attacks. We reuse the robust Res Net model trained by Madry et al. as base model, and use HCNNξ. We modify state-of-the-art ℓ attacks, such as the CW attack (Carlini & Wagner, 2017a), and the PGD attack (Madry et al., 2017), to exploit confidence information in order to break our method by generating high-confidence attacks. We first empirically validate that Madry et al. s model is better, in view of our probabilistic separation property, than models trained without a robustness objective. We then evaluate using confidence to reject adversarial examples, and finally end-to-end defense results. Our results are both encouraging and discouraging: for small radius, we find that confidence is indeed a good discriminator to distinguish right and wrong predictions, and it does improve adversarial robustness. |
| Researcher Affiliation | Collaboration | Xi Wu * 1 Uyeong Jang * 2 Jiefeng Chen 2 Lingjiao Chen 2 Somesh Jha 2 1Google 2University of Wisconsin Madison. Correspondence to: Xi Wu <xiwu@cs.wisc.edu>. |
| Pseudocode | Yes | Algorithm 1 Solving HCNNξ by solving for each label. Input: x a feature vector, ξ > 0 a real parameter, λ 0 a real parameter, a base model F, any gradient-based optimization algorithm O to solve the constrained optimization problem defined in (5). 1: function Oracle HCNN(x, ξ, F) 2: for l C do 3: z(l) O(x, F, l) 4: return z(l ) where l = arg max l C F(z(l))l λ| z(l) x | . |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a repository link or an explicit statement about code release. |
| Open Datasets | Yes | Overall Setup. We study the above questions using ℓ attacks over CIFAR10 (Krizhevsky, 2009). |
| Dataset Splits | No | The paper mentions using a 'test set' for evaluation but does not specify the train/validation/test dataset splits, percentages, or methodology needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions using attacks like CW and PGD but does not provide specific ancillary software details with version numbers, such as library or framework versions. |
| Experiment Setup | Yes | We use a strengthened version of the PGD attack used in (Madry et al., 2017) (with ℓ radius η, 10 random starts, and 100 iterations) to first generate, for each wrong label, an adversarial example whose model confidence is as large as possible. Our implementation of MCNξ solves (5) using the PGD attack with a different setting (ℓ radius ξ, no random start, and 500 iterations). |