Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Authors: Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the robustness of vision transformers (Vi Ts) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that Vi Ts are surprisingly insensitive to patchbased transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that Vi Ts heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as Vi Ts trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve Vi T robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models invariance. We show that patch-based negative augmentation consistently improves robustness of Vi Ts on Image Net based robustness benchmarks across 20+ different experimental settings.
Researcher Affiliation Industry Yao Qin Chiyuan Zhang Ting Chen Balaji Lakshminarayanan Alex Beutel Xuezhi Wang Google Research
Pseudocode No The paper describes algorithmic steps using mathematical formulas but does not include explicit pseudocode blocks or clearly labeled algorithm sections.
Open Source Code Yes (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Supplementary Materials.
Open Datasets Yes We consider Vi T models pretrained on either ILSVRC-2012 Image Net-1k, with 1.3 million images or Image Net-21k, with 14 million images (Russakovsky et al., 2015). All models are fine-tuned on Image Net-1k dataset.
Dataset Splits Yes For the extra loss coefficient λ in Eqn. 1 we sweep it from the set {0.5, 1, 1.5} and choose the model with the best hold-out validation performance.
Hardware Specification No The paper's self-assessment checklist explicitly states: '(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.x').
Experiment Setup Yes Experimental setup We follow Dosovitskiy et al. (2021) to first pre-train all the models with image size 224 224 and then fine-tune the models with a higher resolution 384 384. We reuse all their training hyper-parameters, including batch size, weight decay, and training epochs (see Appendix B for details). For the extra loss coefficient λ in Eqn. 1 we sweep it from the set {0.5, 1, 1.5} and choose the model with the best hold-out validation performance. Please refer to Appendix C for the chosen hyperparameters for each model.