Are Transformers more robust than CNNs?
Authors: Yutong Bai, Jieru Mei, Alan L. Yuille, Cihang Xie
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers training recipes. While regarding generalization on out-of-distribution samples, we show pretraining on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer s self-attention-like architectures per se, rather than by other training setups. |
| Researcher Affiliation | Academia | Yutong Bai1 Jieru Mei1 Alan Yuille1 Cihang Xie2 1Johns Hopkins University 2 University of California, Santa Cruz {ytongbai, meijieru, alan.l.yuille, cihangxie306}@gmail.com |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code and models are publicly available at https://github.com/ytongbai/Vi Ts-vs-CNNs. |
| Open Datasets | Yes | We particularly focus on the comparisons between Small Data-efficient image Transformer (Dei T-S) [43] and Res Net-50 [16]... To train CNNs on Image Net, we follow the standard recipe of [15, 33]. |
| Dataset Splits | Yes | In this section, we investigate the robustness of Transformers and CNNs on defending against adversarial attacks, using Image Net validation set (with 50,000 images). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions optimizers (e.g., momentum-SGD, Adam W) and data augmentation techniques but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | Specifically, we train all CNNs for a total of 100 epochs, using momentum-SGD optimizer; we set the initial learning rate to 0.1, and decrease the learning rate by 10 at the 30-th, 60-th, and 90-th epoch; no regularization except weight decay is applied. Specifically, we train all Transformers using Adam W optimizer; we set the initial learning rate to 5e-4, and apply the cosine learning rate scheduler to decrease it; besides weight decay, we additionally adopt three data augmentation strategies (i.e., Rand Aug [9], Mix Up [59] and Cut Mix [56]) to regularize training. |