reproducibilityindex.ai

Are Transformers more robust than CNNs?

Authors: Yutong Bai, Jieru Mei, Alan L. Yuille, Cihang Xie

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we aim to provide the ﬁrst fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our uniﬁed training setup, we ﬁrst challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we ﬁnd CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers training recipes. While regarding generalization on out-of-distribution samples, we show pretraining on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely beneﬁted by the Transformer s self-attention-like architectures per se, rather than by other training setups.
Researcher Affiliation	Academia	Yutong Bai1 Jieru Mei1 Alan Yuille1 Cihang Xie2 1Johns Hopkins University 2 University of California, Santa Cruz {ytongbai, meijieru, alan.l.yuille, cihangxie306}@gmail.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code and models are publicly available at https://github.com/ytongbai/Vi Ts-vs-CNNs.
Open Datasets	Yes	We particularly focus on the comparisons between Small Data-efﬁcient image Transformer (Dei T-S) [43] and Res Net-50 [16]... To train CNNs on Image Net, we follow the standard recipe of [15, 33].
Dataset Splits	Yes	In this section, we investigate the robustness of Transformers and CNNs on defending against adversarial attacks, using Image Net validation set (with 50,000 images).
Hardware Specification	No	The paper does not provide specific details about the hardware used for the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions optimizers (e.g., momentum-SGD, Adam W) and data augmentation techniques but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	Speciﬁcally, we train all CNNs for a total of 100 epochs, using momentum-SGD optimizer; we set the initial learning rate to 0.1, and decrease the learning rate by 10 at the 30-th, 60-th, and 90-th epoch; no regularization except weight decay is applied. Speciﬁcally, we train all Transformers using Adam W optimizer; we set the initial learning rate to 5e-4, and apply the cosine learning rate scheduler to decrease it; besides weight decay, we additionally adopt three data augmentation strategies (i.e., Rand Aug [9], Mix Up [59] and Cut Mix [56]) to regularize training.