Annealing Self-Distillation Rectification Improves Adversarial Training
Authors: Yu-Yu Wu, Hung-Jui Wang, Shang-Tse Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of ADR through extensive experiments and strong performances across datasets. In this section, we compare the proposed ADR to PGD-AT and TRADES in Table 1. We further investigate the efficacy of ADR in conjunction with model-weight-space smoothing techniques Weight Average (WA) (Izmailov et al., 2018; Gowal et al., 2020) and Adversarial Weight Perturbation (AWP) (Wu et al., 2020) with Res Net18 (He et al., 2016a) in Table 2. Experiments are conducted on well-established benchmark datasets, including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny Image Net-200 (Le & Yang, 2015; Deng et al., 2009). |
| Researcher Affiliation | Academia | Yu-Yu Wu, Hung-Jui Wang, Shang-Tse Chen National Taiwan University {r10922018,r10922061,stchen}@csie.ntu.edu.tw |
| Pseudocode | Yes | Algorithm 1 Annealing Self-Distillation Rectification (ADR) |
| Open Source Code | Yes | Furthermore, the source code can be found in the supplementary materials to ensure the reproducibility of this project. |
| Open Datasets | Yes | Experiments are conducted on well-established benchmark datasets, including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny Image Net-200 (Le & Yang, 2015; Deng et al., 2009). |
| Dataset Splits | Yes | During training, we evaluate the model with PGD-10 and select the model that has the highest robust accuracy on the validation set with early stopping (Rice et al., 2020). |
| Hardware Specification | Yes | The experiment is reported by running each algorithm on a single NVIDIA RTX A6000 GPU with batch size 128. |
| Software Dependencies | No | The paper mentions software components like "SGD optimizer" and uses models like "Res Net-18" and "WRN-34-10", but it does not specify any version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key libraries. |
| Experiment Setup | Yes | We perform adversarial training with perturbation budget ϵ = 8/255 under l -norm in all experiments. In training, we use the 10-step PGD adversary with step size α = 2/255. We adopt β = 6 for TRADES as outlined in the original paper. The models are trained using the SGD optimizer with Nesterov momentum of 0.9, weight decay 0.0005, and a batch size of 128. The initial learning rate is set to 0.1 and divided by 10 at 50% and 75% of the total training epochs. Simple data augmentations include 32 32 random crop with 4-pixel padding and random horizontal flip (Rice et al., 2020; Gowal et al., 2020; Pang et al., 2021) are applied in all experiments. Following Wu et al. (2020); Gowal et al. (2020), we choose radius 0.005 for AWP and decay rate γ = 0.995 for WA. For CIFAR-10/100, we use 200 total training epochs, λ follows cosine scheduling from 0.7 to 0.95, and τ is annealed with cosine decreasing from 2.5 to 2 on CIFAR-10 and 1.5 to 1 on CIFAR100, respectively. As for Tiny Image Net-200, we crop the image size to 64 64 and use 80 training epochs. We adjust λ from 0.5 to 0.9 and τ from 2 to 1.5 on this dataset. |