reproducibilityindex.ai

Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Authors: Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the robustness of vision transformers (Vi Ts) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We ﬁnd that Vi Ts are surprisingly insensitive to patchbased transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that Vi Ts heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as Vi Ts trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve Vi T robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models invariance. We show that patch-based negative augmentation consistently improves robustness of Vi Ts on Image Net based robustness benchmarks across 20+ different experimental settings.
Researcher Affiliation	Industry	Yao Qin Chiyuan Zhang Ting Chen Balaji Lakshminarayanan Alex Beutel Xuezhi Wang Google Research
Pseudocode	No	The paper describes algorithmic steps using mathematical formulas but does not include explicit pseudocode blocks or clearly labeled algorithm sections.
Open Source Code	Yes	(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Supplementary Materials.
Open Datasets	Yes	We consider Vi T models pretrained on either ILSVRC-2012 Image Net-1k, with 1.3 million images or Image Net-21k, with 14 million images (Russakovsky et al., 2015). All models are ﬁne-tuned on Image Net-1k dataset.
Dataset Splits	Yes	For the extra loss coefﬁcient λ in Eqn. 1 we sweep it from the set {0.5, 1, 1.5} and choose the model with the best hold-out validation performance.
Hardware Specification	No	The paper's self-assessment checklist explicitly states: '(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.x').
Experiment Setup	Yes	Experimental setup We follow Dosovitskiy et al. (2021) to ﬁrst pre-train all the models with image size 224 224 and then ﬁne-tune the models with a higher resolution 384 384. We reuse all their training hyper-parameters, including batch size, weight decay, and training epochs (see Appendix B for details). For the extra loss coefﬁcient λ in Eqn. 1 we sweep it from the set {0.5, 1, 1.5} and choose the model with the best hold-out validation performance. Please refer to Appendix C for the chosen hyperparameters for each model.