Understanding The Robustness in Vision Transformers
Authors: Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, Jose M. Alvarez
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% m CE on Image Net-1k and Image Net-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. |
| Researcher Affiliation | Collaboration | 1National University of Singapore 2NVIDIA 3The University of Hong Kong 4ASU 5Caltech 6Byte Dance. Correspondence to: Zhiding Yu <zhidingy@nvidia.com>. |
| Pseudocode | No | The paper describes methods through mathematical formulations (e.g., Eqn. 1, 2, 3, 4, 5, 6, 7) and architectural diagrams (Figure 2, Figure 5) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code will be available at https://github.com/NVlabs/FAN. |
| Open Datasets | Yes | We verify the model robustness on Imagenet-C (IN-C), Cityscape-C and COCOC without extra corruption related fine-tuning. For Cityscapes, we take the average m Io U for three severity levels for the noise category, following the practice in Seg Former (Xie et al., 2021). For all the rest of the datasets, we take the average of all five severity levels. Image Net classification For all the experiments and ablation studies, the models are pretrained on Image Net-1K if not specified additionally. |
| Dataset Splits | Yes | Datasets and evaluation metrics. We verify the model robustness on Imagenet-C (IN-C), Cityscape-C and COCOC without extra corruption related fine-tuning. The suffix -C denotes the corrupted images based on the original dataset with the same manner proposed in (Hendrycks & Dietterich, 2019). For Cityscapes, we take the average m Io U for three severity levels for the noise category, following the practice in Seg Former (Xie et al., 2021). For all the rest of the datasets, we take the average of all five severity levels. For semantic segmentation and object detection, we load the Image Net-1k pretrained weights and finetune on Cityscpaes and COCO clean image dataset. Then we directly evaluate the performance on Cityscapes-C and COCO-C. |
| Hardware Specification | No | The paper mentions 'GPU Setting' in its result tables, indicating that GPUs were used for experiments (e.g., 'GPU Setting (20M+)', 'GPU Setting (50M+)'), but it does not specify any particular GPU model (e.g., NVIDIA A100, RTX 3090), CPU model, or other hardware components. |
| Software Dependencies | No | The evaluation code is based on timm library (Wightman, 2019). The codes are developed using MMSegmentation (Contributors, 2020) and MMDetection (Chen et al., 2019) toolbox. |
| Experiment Setup | Yes | Specifically, we train FAN for 300 epochs using Adam W with a learning rate of 2e-3. We use 5 epochs to linearly warmup the model. We adopt a cosine decaying schedule afterward. We use a batch size of 2048 and a weight decay of 0.05. We adopt the same data augmentation schemes as (Touvron et al., 2021a) including Mixup, Cutmix, Rand Augment, and Random Erasing. We use Exponential Moving Average (EMA) to speed up the model convergence in a similar manner as timm library (Wightman, 2019). |